You are on page 1of 15

12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

OCDQBlog
ObsessiveCompulsiveDataQualitybyJim
Harris

Home Blog Pod ca st Videos Best of OC DQ Pu blished Articles About

Contact R SS

OCDQBlog
AdventuresinDataProling(Part2)
August05,2009JimHarris ObsessiveCompulsiveDataQuality(OCDQ)
isablogoeringavendorneutralperspective
ondataqualityanditsrelateddisciplines.

InPart1ofthisseries:Theadventuresbeganwiththefollowing
scenarioYouareanexternalconsultantonanewdataquality
Search
initiative.Youhavegot3,338,190customerrecordstoanalyze,arobust
dataprolingtool,halfacaseofMountainDew,itsdark,andyoure
wearingsunglasses...ok,maybenotthoselasttwoorthreethingsbut SelectPostsbyTopic...
therestistrue.

Youhavenopriorknowledgeofthedataoritsexpectedcharacteristics.
Youareperformingthisanalysiswithouttheaidofeitherbusiness
requirementsorsubjectmaerexperts.Yourgoalistolearnusmuchas
youcanaboutthedataandthenpreparemeaningfulquestionsand
reportstosharewiththerestofyourteam.

Thecustomerdatasourcewasprocessedbythedataprolingtool,
whichprovidedthefollowingstatisticalsummaries: JimHarrisistheOCDQBloggerinChiefand
afreelancewriter,professionalspeaker,
thoughtleader,andindependentconsultant.

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 1/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

TheAdventuresContinue...

InPart1,weaskedifCustomerIDwastheprimarykeyforthisdata
source.Inanaempttoanswerthisquestion,letsclickonitand
drilldowntoaeldsummaryprovidedbythedataprolingtool:


Please

rememberthatmydataprolingtoolisctional(i.e.notmodeledafter
anyrealproduct)andthereforeallofmyscreenshotsarecustomized
toillustrateseriesconcepts.Thisscreenwouldnotonlylook

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 2/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

dierentlyinarealdataprolingtool,butitwouldalsocontain
additionalinformation.

ThiseldsummaryforCustomerIDincludessomeinputmetadata,
identifyingtheexpecteddatatypeandeldlength.Verifyingdata
matchesthemetadatathatdescribesitisoneessentialanalyticaltask
thatdataprolingcanhelpuswith,providingamuchneededreality
checkfortheperceptionsandassumptionsthatwemayhaveaboutour
data.

ThedataprolingsummarystatisticsforCustomerIDarelisted,
followedbysomeusefuladditionalstatistics:thecountofthenumberof
distinctdatatypes(basedonanalyzingthevalues,notthemetadata),
minimum/maximumeldlengths,minimum/maximumeldvalues,
andthecountofthenumberofdistincteldformats.

Wecan
usedrill
downs
onthe
eld

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 3/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

summaryscreentogetmoredetailsaboutCustomerIDprovidedby
thedataprolingtool.

Thecountofthenumberofdistinctdatatypesisexplainedbythedata
prolingtoolobservingeldvaluesthatcouldberepresentedbythree
dierentintegerdatatypesbasedonprecision(whichcanvaryby
RDBMS).Dierenttoolswouldrepresentthisindierentways
(includingtheoptiontoautomaticallycollapsethelistintothedatatype
ofthehighestprecisionthatcouldstoreallofthevalues).

Drillingdownontheelddatatypesshowstheeldvalues(inthis
example,limitedtothe5mostfrequentlyoccurringvalues).Please
note,Ihaveintentionallycustomizedtheseliststorevealhintsaboutthe
precisionbreakdownusedbymyctionalRDBMS.

Thecountofthenumberofdistincteldformatsshowsthefrequency
distributionofthesevennumericpaernsobservedbythedata
prolingtoolforCustomerID:7digits,6digits,5digits,4digits,3
digits,2digits,and1digit.Wecouldalsocontinuedrillingdowntosee
theactualeldvaluesbehindtheeldformats.

Basedonanalyzingalloftheinformationprovidedtoyoubythedata
prolingtool,canyousafelyassumethatCustomerIDisaninteger
surrogatekeythatcanbeusedastheprimarykeyforthisdatasource?

InPart1,weaskedwhytheGenderCodeeldhas8distinctvalues.
Cardinalitycanplayamajorroleindecidingwhetherornotyouwant
todrilldowntoeldvaluesoreldformatssinceitismucheasierto
reviewalloftheeldvalueswhentherearenotverymanyofthem.
Alternatively,thereviewofhighcardinalityeldscanalsobelimitedto
themostfrequentlyoccurringvalues(wewillseeseveralexamplesof
thisalternativelaterintheserieswhenanalyzingsomeoftheother
elds).

Wewilldrilldowntothisscreentoviewthefrequencydistribution

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 4/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

oftheeldvaluesforGenderCodeprovidedbythedataprolingtool.

ItisprobablynotmuchofastretchtoassumethatFisanabbreviation
forFemaleandMisanabbreviationforMale.Also,youmayaskif
UnknownisanybeerofavaluethanNULLorMissing(whicharenot
listedbecausethelistwasintentionallylteredtoincludeonlyActual
values).

However,itisdangeroustoassumeanythingandwhataboutthose
numericvalues?Additionally,youmaywonderifGenderCodecantell
usanythingaboutthecharacteristicsoftheCustomerNameelds.For
example,dotherecordswithaNULLorMissingvalueinGenderCode
indicatethepresenceofanorganizationnameanddotherecordswith
anActualGenderCodevalueindicatethepresenceofapersonalname?

Toaempttoanswerthesequestions,itmaybehelpfultoreview
recordswitheachoftheseeldvalues.Therefore,letsassumethatwe
haveperformeddrilldownanalysisusingthedataprolingtooland
haveselectedthefollowingrecordsofinterest:

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 5/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

Asissooftenthecase,datararelyconformstoourassumptionsabout
it.Althoughwewillperformmoredetailedanalysislaterintheseries,
whatareyourthoughtsatthispointregardingtheGenderCodeand
CustomerNameelds?

InPart3ofthisseries:Wewillcontinuetheadventuresbyusinga
combinationofeldvaluesandeldformatstobeginouranalysisof
thefollowingelds:BirthDate,TelephoneNumberandEmailAddress.

RelatedPosts

AdventuresinDataProling(Part1)

AdventuresinDataProling(Part3)

AdventuresinDataProling(Part4)

AdventuresinDataProling(Part5)

AdventuresinDataProling(Part6)

AdventuresinDataProling(Part7)

GeingYourDataFreqOn

12 Comments 0 Likes
Data Quality, Methodology
Adventure In Data Proling, Data Proling

ADVENTURES IN DATA PROFILING


ADVENTURES
(PART ... IN DATA PROFILING (PART ...

Comments(12) NewestFirst Subscribeviaemail

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 6/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

Preview PostComment

VishAgashe 7yearsago

Alright...ifdatesarenotavailable...

Itseemslikeeither0orUnknownisbeingenteredwhenthereis
noclarityaboutthename:

0orUnknownisbeingusedwhenonlyinitialsareinthe
customername(0forT.S.ElliotandUnknownforJ.D.Salinger)

0isusedforHuggyBearBrown(Soundstoosuspicioustobea
realname)

UnknownisusedforPeterandLoisGrin

Whilethispaerndoesnothavestatisticallysignicantsetto
conclude,furtherdiggingintothiswouldbeagoodidea...

LookfortheGenderCodewhere&orandisinthecustomer
name.

VishAgashe

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 7/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

JimHarris 7yearsago

HiVish,

Firstofall,Iamgladtoseethatthisseriesisonyourreadinglist
:)

IagreethatcheckingtheassociatedMIN/MAXCreateand
UpdateDateswouldbeexcellentanalysistoperforminorderto
determineifthegendercodingconventionshavechangedover
time.

Thisinformationwasunfortunatelynotavailableforthisdata
source.

However,asImentionedabove,evenifwecouldconrmthat
(andwhen)thegendercodingconventionschanged,itprobably
stilldoesntexplainthegendercodingoftherecordswehave
lookedatsofar.

BestRegards...

Jim

VishAgashe 7yearsago

Jim,

Catchinguponmyreading.SoIdecidedtogosequentially.I
havenotreadpart3orabove.

Aquickquestion,onthetablewhereyouhavevaluesforgender
codeandcount,wouldgeingdatesbepossible?Min(Created
Date),Max(CreatedDate),Min(UpdateDate),Max(Update
Date).

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 8/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

Ifwecangetsomevisibilityintothesedates,someofthe
answerstoabovediscussionscanbefound(iftheLegacysystem
implementedhowgendersareenteredorifsomeofthis
informationiscomingfromdatamigrationdonefromother
systemsetc.)

Wasthatinformationavailabletotheconsultant?

Oh,andbytheway,ifyouareaconsultantnexttomeintheline
forlunchonmyanniversarywithmyemployer,Iliketoeat
brownieswithmyPizza...:)

Vish

Rahul 7yearsago

ItisverygoodanalysisanddiscussiononProling.

SomeadditionalinformationthatIwanttocontributeafter
lookingatprolingreportisthat,asyouseeCustomerId(Field
Format)Countsareinproperformat.Thereisnotasingle
numbermissedinrange(13,338,190).Justlookatcountsfor
dierentformat(sayinsingledigitnumbermaximum9values
canexists,intwodigits90valuesandsoon).Sothereisno
missingCustomerIdwhateverrandomnumberfunctionisused,
theymaybeormaynotbesequential.

Apartfromthat,sevendigitsnumbersstillhave6,661,809
(9,000,0002,338,191)valuesavailablethatis284%ofoccupied
values(i.e.3,338,190).SorightknowDWHiscomfortableto
acceptalmostthreetimescustomersthanthepresentcount.This
canhelpustodecidestrategyforfutureCustomerIds.

Rahul

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 9/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

KayRose 7yearsago

Jim,

Iamamused.Thanksforthechucklebeforeheadingintomy
weekend.

:)

Regardsbackatcha...

Kay

JimHarris 7yearsago

Kay,

Youhavearemarkableeyeforspoingapotentialpaerninthe
dataandyouprovideanexcellentdeductionforwhythis
paerncouldbeoccurring.

IamwritingthisseriesinpartbecauseItrulynddataanalysis
tobeafascinatingadventure.IhavebloggedaboutwhatArkady
MaydanchikcallsDataGazingandwhythisisanessentialskill
fordataqualityinitiatives(andreallyallenterpriseinformation
initiatives).

SoIcanthelpbutmakeanadditionalobservationofmyown
aboutGenderCodeitsometimesappearstobeincorrect.

Forexample,isGeorgeEliot(GenderCode=2=Female),asmy
favoriteJohnnyCashsongwouldsay,analogoustoABoy
NamedSue?

Beyondmyintentionaljoke(GeorgeEliotwasthepenname
usedby19thcenturyEnglishnovelistMaryAnneEvans),canwe

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 10/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

trulydeterminegendercodewithabsolutecertaintyeven
whenitisprovidedtousdirectlyfromthecustomer?

Afterall,MaryAnneEvanssadlyhadtotellherreadershipthat
shewasamaninorder(asshepersonallyfeltwhetheritwas
actuallytrueornot)forherworkstobetakenseriouslyas
literatureatthetime.

BestRegards

Jim

P.S.Additionally,itmayamuseyoutoknowthattheCustomer
IDvaluesinmydataweregeneratedusingarandomnumber
functionbutIcannotdenythepaernyounoticed.

KayRose 7yearsago

SomethingIfoundinterestingaboutGenderCodeisthe
CustomerIDvaluesassociatedwiththem.ForexamplePatrick
ThameshasCustomerIDof725019andaGenderCodeof1,
whileTerezaM.KunderahasCustomerIDof2232687anda
GenderCodeofF,andMaryAnneEvanshasacustomerIDof
2828666andaGenderCodeof2.IftheCustomerIDistrulya
sequentialnumerickey,thenitseemsthatthedierentGender
Codevalues(numericvsalpha)arebeingenteredintothe
databaseduringthesametimeperiods.Iwonderifthisimplies
thatthedatainthislegacydatabasearecomingfromdierent
sourcesystemsandthereforesupplyingdierencecodevalues.

DaraghOBrien 7yearsago

ExcellentarticleJim.

Andyouteaseuswithyourstripteaselikeexposureofthe
tantalizingdataandrulesthatexistbelowthesurface...

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 11/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

JimHarris 7yearsago

Excellentanalysis,asusualPhil!

TheonlyadditionalinformationthatIwillrevealatthistime
(countitasanotherpreviewprovidedbyBillsgratitudeforus
buyinghispizza),isthatthenumericgendercodeswere
intendedtorepresentthefollowingscheme:

0=Unknown
1=Male
2=Female

WhenwereturntoamoredetailedanalysisofCustomerName
laterintheseries(probablyPart5ofwhatnowislookinglikea8
partseries),wewillmaywonderiftheinputGenderCodecan
betrustedorifweshouldrecommendthatsomeexternal
referencedataisusedtomakegenderrecommendationsforall
ofthepersonalnamesfoundonthecustomerrecords.

PerhapsIhavesaidtoomuchthiscommentwillselfdestruct
intoolate:)

BestRegards

Jim

PhilA 7yearsago

Theresonly1gendercodebut2customernameelds.Which
customerelddoesgenderapplyto?

Ononehanditappearstobecustomername1sincewehave
Elinor(Female?)andRobert(Male?)FrostrecordedasFemale
(fromname1)butbycontrastwehavePatrickThamesand
GeorgeEliot(name1)withdierentgenders(1&2,whatever
theyare).

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 12/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

ThentherestheGrinswhoarebothinthesameeldperhaps
weneedaFamilyGuyandFamilyGalasgendercodes...

Whatcanwededuceaboutthenumericcodes?Wecanseefrom
thedistributionthattherearemoreFsthanMsandmore
FemalesthanMalessowemightguessthatbecausethereare
more1sthan2sthe1smustbeFemale.

IsittruethatthisbusinessdealswithmoreFemalecustomers
thanMale?Whatbusinessisit?Ifthedistributionoughtbe
roughlyequalsplitbetweengendersthenwemightdecidethat
1saremaletobalanceitup.

Oh,andtheressome0sinthereaswell,wecantassumethat0
isunknown,itmightbe0female,1male,2unknown...

Thereslilepointspeculating,wehavetoaskthequestion
aboutthenumericvalues,ordoyouhavesomethingelseup
yoursleeve?

JimHarris 7yearsago

Phil,

Thanksforcontinuingtoprovideyourperspectivesand
questionsbotharegreatlyappreciated.

IshareyourconcernaboutCustomerIDdenitelyaquestion
toincludeinourreport.

IalsoshareyourdeductivequestionsaboutGenderCode.Lets
pretendthatwefortuitouslyfoundourselvesstandinginlinein
thecafeteriabehindabusinessanalystthatweknowisworking
ontherstdraftofthebusinessrequirementsandwedecideto
engagehiminsomecasualconversation:

Us:HeyBill,howareyoutoday?Lookslikeeveryoneisinline
forthepizzatoday.Hey,Iwaswonderingwhatthesourceof

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 13/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

thatcustomerdataIamlookingatisdoyouknowbyany
chance?

Bill:HeyConsultantDude,Iamgoingwell,thanksforasking.
Yes,pizzadayisalwaysabighitinthecafeteria.Thatcustomer
datawegaveyou?Oh,itsafullvolumesnapshotofourlegacy
systemasoflastweek.

Us:Cool,goodtoknow.Howoldisthelegacysystem?

Bill:Well,itsbeenusedasoursystemofrecordforaslongasI
havebeenhereandthatwillbe15yearsnextweek.

Us:Congratulations,Bill.Ihopeyouatleastgetyour
anniversaryasanextrapersonalday.Attheveryleast,letme
buyyourpizzatoday.

Weexternalconsultantsarewilyfolk:)

So,perhapswecanassumethatsincethedatasourceisalegacy
systemthathasbeeninuseforatleast15years,itiscertainly
possiblethatgendercodinghaschangedovertimeandthat
olderrecordsmaynothavebeenupdatedwhentheconventions
changed.However,thatwouldstillbeanassumptionandit
probablystilldoesntexplainsomeoftherecordswehave
lookedatsofar,doesit?

BestRegards

Jim

PhilA 7yearsago

ItsafairbetthatCustomerIDisasequencebutwhathappens
whenitexceeds10digits?Willitbeincreasedtoaccommodate
moreorwillsomeolddatabeculledandthenumberreused.
AsaDataWarehouse(DW)wedontwanttodeleteanydata
theupstreammaynotcare.

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 14/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog

Hasthegendercodingchangedovertime,orareallvariations
stillbeingentered?Istheremorethan1systemsupplyingthis
data,eachhavingdierentstandards?

Home Blog Podcast Videos Bestof Published About Contact


OCDQ Articles

2016,JimHarris. PoweredbySquarespace

http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 15/15

You might also like