Professional Documents
Culture Documents
OCDQBlog
ObsessiveCompulsiveDataQualitybyJim
Harris
Contact R SS
OCDQBlog
AdventuresinDataProling(Part2)
August05,2009JimHarris ObsessiveCompulsiveDataQuality(OCDQ)
isablogoeringavendorneutralperspective
ondataqualityanditsrelateddisciplines.
InPart1ofthisseries:Theadventuresbeganwiththefollowing
scenarioYouareanexternalconsultantonanewdataquality
Search
initiative.Youhavegot3,338,190customerrecordstoanalyze,arobust
dataprolingtool,halfacaseofMountainDew,itsdark,andyoure
wearingsunglasses...ok,maybenotthoselasttwoorthreethingsbut SelectPostsbyTopic...
therestistrue.
Youhavenopriorknowledgeofthedataoritsexpectedcharacteristics.
Youareperformingthisanalysiswithouttheaidofeitherbusiness
requirementsorsubjectmaerexperts.Yourgoalistolearnusmuchas
youcanaboutthedataandthenpreparemeaningfulquestionsand
reportstosharewiththerestofyourteam.
Thecustomerdatasourcewasprocessedbythedataprolingtool,
whichprovidedthefollowingstatisticalsummaries: JimHarrisistheOCDQBloggerinChiefand
afreelancewriter,professionalspeaker,
thoughtleader,andindependentconsultant.
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 1/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
TheAdventuresContinue...
InPart1,weaskedifCustomerIDwastheprimarykeyforthisdata
source.Inanaempttoanswerthisquestion,letsclickonitand
drilldowntoaeldsummaryprovidedbythedataprolingtool:
Please
rememberthatmydataprolingtoolisctional(i.e.notmodeledafter
anyrealproduct)andthereforeallofmyscreenshotsarecustomized
toillustrateseriesconcepts.Thisscreenwouldnotonlylook
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 2/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
dierentlyinarealdataprolingtool,butitwouldalsocontain
additionalinformation.
ThiseldsummaryforCustomerIDincludessomeinputmetadata,
identifyingtheexpecteddatatypeandeldlength.Verifyingdata
matchesthemetadatathatdescribesitisoneessentialanalyticaltask
thatdataprolingcanhelpuswith,providingamuchneededreality
checkfortheperceptionsandassumptionsthatwemayhaveaboutour
data.
ThedataprolingsummarystatisticsforCustomerIDarelisted,
followedbysomeusefuladditionalstatistics:thecountofthenumberof
distinctdatatypes(basedonanalyzingthevalues,notthemetadata),
minimum/maximumeldlengths,minimum/maximumeldvalues,
andthecountofthenumberofdistincteldformats.
Wecan
usedrill
downs
onthe
eld
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 3/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
summaryscreentogetmoredetailsaboutCustomerIDprovidedby
thedataprolingtool.
Thecountofthenumberofdistinctdatatypesisexplainedbythedata
prolingtoolobservingeldvaluesthatcouldberepresentedbythree
dierentintegerdatatypesbasedonprecision(whichcanvaryby
RDBMS).Dierenttoolswouldrepresentthisindierentways
(includingtheoptiontoautomaticallycollapsethelistintothedatatype
ofthehighestprecisionthatcouldstoreallofthevalues).
Drillingdownontheelddatatypesshowstheeldvalues(inthis
example,limitedtothe5mostfrequentlyoccurringvalues).Please
note,Ihaveintentionallycustomizedtheseliststorevealhintsaboutthe
precisionbreakdownusedbymyctionalRDBMS.
Thecountofthenumberofdistincteldformatsshowsthefrequency
distributionofthesevennumericpaernsobservedbythedata
prolingtoolforCustomerID:7digits,6digits,5digits,4digits,3
digits,2digits,and1digit.Wecouldalsocontinuedrillingdowntosee
theactualeldvaluesbehindtheeldformats.
Basedonanalyzingalloftheinformationprovidedtoyoubythedata
prolingtool,canyousafelyassumethatCustomerIDisaninteger
surrogatekeythatcanbeusedastheprimarykeyforthisdatasource?
InPart1,weaskedwhytheGenderCodeeldhas8distinctvalues.
Cardinalitycanplayamajorroleindecidingwhetherornotyouwant
todrilldowntoeldvaluesoreldformatssinceitismucheasierto
reviewalloftheeldvalueswhentherearenotverymanyofthem.
Alternatively,thereviewofhighcardinalityeldscanalsobelimitedto
themostfrequentlyoccurringvalues(wewillseeseveralexamplesof
thisalternativelaterintheserieswhenanalyzingsomeoftheother
elds).
Wewilldrilldowntothisscreentoviewthefrequencydistribution
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 4/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
oftheeldvaluesforGenderCodeprovidedbythedataprolingtool.
ItisprobablynotmuchofastretchtoassumethatFisanabbreviation
forFemaleandMisanabbreviationforMale.Also,youmayaskif
UnknownisanybeerofavaluethanNULLorMissing(whicharenot
listedbecausethelistwasintentionallylteredtoincludeonlyActual
values).
However,itisdangeroustoassumeanythingandwhataboutthose
numericvalues?Additionally,youmaywonderifGenderCodecantell
usanythingaboutthecharacteristicsoftheCustomerNameelds.For
example,dotherecordswithaNULLorMissingvalueinGenderCode
indicatethepresenceofanorganizationnameanddotherecordswith
anActualGenderCodevalueindicatethepresenceofapersonalname?
Toaempttoanswerthesequestions,itmaybehelpfultoreview
recordswitheachoftheseeldvalues.Therefore,letsassumethatwe
haveperformeddrilldownanalysisusingthedataprolingtooland
haveselectedthefollowingrecordsofinterest:
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 5/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
Asissooftenthecase,datararelyconformstoourassumptionsabout
it.Althoughwewillperformmoredetailedanalysislaterintheseries,
whatareyourthoughtsatthispointregardingtheGenderCodeand
CustomerNameelds?
InPart3ofthisseries:Wewillcontinuetheadventuresbyusinga
combinationofeldvaluesandeldformatstobeginouranalysisof
thefollowingelds:BirthDate,TelephoneNumberandEmailAddress.
RelatedPosts
AdventuresinDataProling(Part1)
AdventuresinDataProling(Part3)
AdventuresinDataProling(Part4)
AdventuresinDataProling(Part5)
AdventuresinDataProling(Part6)
AdventuresinDataProling(Part7)
GeingYourDataFreqOn
12 Comments 0 Likes
Data Quality, Methodology
Adventure In Data Proling, Data Proling
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 6/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
Preview PostComment
VishAgashe 7yearsago
Alright...ifdatesarenotavailable...
Itseemslikeeither0orUnknownisbeingenteredwhenthereis
noclarityaboutthename:
0orUnknownisbeingusedwhenonlyinitialsareinthe
customername(0forT.S.ElliotandUnknownforJ.D.Salinger)
0isusedforHuggyBearBrown(Soundstoosuspicioustobea
realname)
UnknownisusedforPeterandLoisGrin
Whilethispaerndoesnothavestatisticallysignicantsetto
conclude,furtherdiggingintothiswouldbeagoodidea...
LookfortheGenderCodewhere&orandisinthecustomer
name.
VishAgashe
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 7/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
JimHarris 7yearsago
HiVish,
Firstofall,Iamgladtoseethatthisseriesisonyourreadinglist
:)
IagreethatcheckingtheassociatedMIN/MAXCreateand
UpdateDateswouldbeexcellentanalysistoperforminorderto
determineifthegendercodingconventionshavechangedover
time.
Thisinformationwasunfortunatelynotavailableforthisdata
source.
However,asImentionedabove,evenifwecouldconrmthat
(andwhen)thegendercodingconventionschanged,itprobably
stilldoesntexplainthegendercodingoftherecordswehave
lookedatsofar.
BestRegards...
Jim
VishAgashe 7yearsago
Jim,
Catchinguponmyreading.SoIdecidedtogosequentially.I
havenotreadpart3orabove.
Aquickquestion,onthetablewhereyouhavevaluesforgender
codeandcount,wouldgeingdatesbepossible?Min(Created
Date),Max(CreatedDate),Min(UpdateDate),Max(Update
Date).
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 8/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
Ifwecangetsomevisibilityintothesedates,someofthe
answerstoabovediscussionscanbefound(iftheLegacysystem
implementedhowgendersareenteredorifsomeofthis
informationiscomingfromdatamigrationdonefromother
systemsetc.)
Wasthatinformationavailabletotheconsultant?
Oh,andbytheway,ifyouareaconsultantnexttomeintheline
forlunchonmyanniversarywithmyemployer,Iliketoeat
brownieswithmyPizza...:)
Vish
Rahul 7yearsago
ItisverygoodanalysisanddiscussiononProling.
SomeadditionalinformationthatIwanttocontributeafter
lookingatprolingreportisthat,asyouseeCustomerId(Field
Format)Countsareinproperformat.Thereisnotasingle
numbermissedinrange(13,338,190).Justlookatcountsfor
dierentformat(sayinsingledigitnumbermaximum9values
canexists,intwodigits90valuesandsoon).Sothereisno
missingCustomerIdwhateverrandomnumberfunctionisused,
theymaybeormaynotbesequential.
Apartfromthat,sevendigitsnumbersstillhave6,661,809
(9,000,0002,338,191)valuesavailablethatis284%ofoccupied
values(i.e.3,338,190).SorightknowDWHiscomfortableto
acceptalmostthreetimescustomersthanthepresentcount.This
canhelpustodecidestrategyforfutureCustomerIds.
Rahul
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 9/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
KayRose 7yearsago
Jim,
Iamamused.Thanksforthechucklebeforeheadingintomy
weekend.
:)
Regardsbackatcha...
Kay
JimHarris 7yearsago
Kay,
Youhavearemarkableeyeforspoingapotentialpaerninthe
dataandyouprovideanexcellentdeductionforwhythis
paerncouldbeoccurring.
IamwritingthisseriesinpartbecauseItrulynddataanalysis
tobeafascinatingadventure.IhavebloggedaboutwhatArkady
MaydanchikcallsDataGazingandwhythisisanessentialskill
fordataqualityinitiatives(andreallyallenterpriseinformation
initiatives).
SoIcanthelpbutmakeanadditionalobservationofmyown
aboutGenderCodeitsometimesappearstobeincorrect.
Forexample,isGeorgeEliot(GenderCode=2=Female),asmy
favoriteJohnnyCashsongwouldsay,analogoustoABoy
NamedSue?
Beyondmyintentionaljoke(GeorgeEliotwasthepenname
usedby19thcenturyEnglishnovelistMaryAnneEvans),canwe
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 10/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
trulydeterminegendercodewithabsolutecertaintyeven
whenitisprovidedtousdirectlyfromthecustomer?
Afterall,MaryAnneEvanssadlyhadtotellherreadershipthat
shewasamaninorder(asshepersonallyfeltwhetheritwas
actuallytrueornot)forherworkstobetakenseriouslyas
literatureatthetime.
BestRegards
Jim
P.S.Additionally,itmayamuseyoutoknowthattheCustomer
IDvaluesinmydataweregeneratedusingarandomnumber
functionbutIcannotdenythepaernyounoticed.
KayRose 7yearsago
SomethingIfoundinterestingaboutGenderCodeisthe
CustomerIDvaluesassociatedwiththem.ForexamplePatrick
ThameshasCustomerIDof725019andaGenderCodeof1,
whileTerezaM.KunderahasCustomerIDof2232687anda
GenderCodeofF,andMaryAnneEvanshasacustomerIDof
2828666andaGenderCodeof2.IftheCustomerIDistrulya
sequentialnumerickey,thenitseemsthatthedierentGender
Codevalues(numericvsalpha)arebeingenteredintothe
databaseduringthesametimeperiods.Iwonderifthisimplies
thatthedatainthislegacydatabasearecomingfromdierent
sourcesystemsandthereforesupplyingdierencecodevalues.
DaraghOBrien 7yearsago
ExcellentarticleJim.
Andyouteaseuswithyourstripteaselikeexposureofthe
tantalizingdataandrulesthatexistbelowthesurface...
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 11/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
JimHarris 7yearsago
Excellentanalysis,asusualPhil!
TheonlyadditionalinformationthatIwillrevealatthistime
(countitasanotherpreviewprovidedbyBillsgratitudeforus
buyinghispizza),isthatthenumericgendercodeswere
intendedtorepresentthefollowingscheme:
0=Unknown
1=Male
2=Female
WhenwereturntoamoredetailedanalysisofCustomerName
laterintheseries(probablyPart5ofwhatnowislookinglikea8
partseries),wewillmaywonderiftheinputGenderCodecan
betrustedorifweshouldrecommendthatsomeexternal
referencedataisusedtomakegenderrecommendationsforall
ofthepersonalnamesfoundonthecustomerrecords.
PerhapsIhavesaidtoomuchthiscommentwillselfdestruct
intoolate:)
BestRegards
Jim
PhilA 7yearsago
Theresonly1gendercodebut2customernameelds.Which
customerelddoesgenderapplyto?
Ononehanditappearstobecustomername1sincewehave
Elinor(Female?)andRobert(Male?)FrostrecordedasFemale
(fromname1)butbycontrastwehavePatrickThamesand
GeorgeEliot(name1)withdierentgenders(1&2,whatever
theyare).
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 12/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
ThentherestheGrinswhoarebothinthesameeldperhaps
weneedaFamilyGuyandFamilyGalasgendercodes...
Whatcanwededuceaboutthenumericcodes?Wecanseefrom
thedistributionthattherearemoreFsthanMsandmore
FemalesthanMalessowemightguessthatbecausethereare
more1sthan2sthe1smustbeFemale.
IsittruethatthisbusinessdealswithmoreFemalecustomers
thanMale?Whatbusinessisit?Ifthedistributionoughtbe
roughlyequalsplitbetweengendersthenwemightdecidethat
1saremaletobalanceitup.
Oh,andtheressome0sinthereaswell,wecantassumethat0
isunknown,itmightbe0female,1male,2unknown...
Thereslilepointspeculating,wehavetoaskthequestion
aboutthenumericvalues,ordoyouhavesomethingelseup
yoursleeve?
JimHarris 7yearsago
Phil,
Thanksforcontinuingtoprovideyourperspectivesand
questionsbotharegreatlyappreciated.
IshareyourconcernaboutCustomerIDdenitelyaquestion
toincludeinourreport.
IalsoshareyourdeductivequestionsaboutGenderCode.Lets
pretendthatwefortuitouslyfoundourselvesstandinginlinein
thecafeteriabehindabusinessanalystthatweknowisworking
ontherstdraftofthebusinessrequirementsandwedecideto
engagehiminsomecasualconversation:
Us:HeyBill,howareyoutoday?Lookslikeeveryoneisinline
forthepizzatoday.Hey,Iwaswonderingwhatthesourceof
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 13/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
thatcustomerdataIamlookingatisdoyouknowbyany
chance?
Bill:HeyConsultantDude,Iamgoingwell,thanksforasking.
Yes,pizzadayisalwaysabighitinthecafeteria.Thatcustomer
datawegaveyou?Oh,itsafullvolumesnapshotofourlegacy
systemasoflastweek.
Us:Cool,goodtoknow.Howoldisthelegacysystem?
Bill:Well,itsbeenusedasoursystemofrecordforaslongasI
havebeenhereandthatwillbe15yearsnextweek.
Us:Congratulations,Bill.Ihopeyouatleastgetyour
anniversaryasanextrapersonalday.Attheveryleast,letme
buyyourpizzatoday.
Weexternalconsultantsarewilyfolk:)
So,perhapswecanassumethatsincethedatasourceisalegacy
systemthathasbeeninuseforatleast15years,itiscertainly
possiblethatgendercodinghaschangedovertimeandthat
olderrecordsmaynothavebeenupdatedwhentheconventions
changed.However,thatwouldstillbeanassumptionandit
probablystilldoesntexplainsomeoftherecordswehave
lookedatsofar,doesit?
BestRegards
Jim
PhilA 7yearsago
ItsafairbetthatCustomerIDisasequencebutwhathappens
whenitexceeds10digits?Willitbeincreasedtoaccommodate
moreorwillsomeolddatabeculledandthenumberreused.
AsaDataWarehouse(DW)wedontwanttodeleteanydata
theupstreammaynotcare.
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 14/15
12/29/2016 AdventuresinDataProfiling(Part2)OCDQBlog
Hasthegendercodingchangedovertime,orareallvariations
stillbeingentered?Istheremorethan1systemsupplyingthis
data,eachhavingdierentstandards?
2016,JimHarris. PoweredbySquarespace
http://www.ocdqblog.com/home/adventuresindataprofilingpart2.html 15/15