Professional Documents
Culture Documents
ETL
Extraction
Transformation
Loading
ETLOverview
ExtractionTransformationLoading ETL
Togetdataoutofthesourceandloaditintothedata
warehouse simplyaprocessofcopyingdatafromone
databasetoother
Dataisextracted fromanOLTPdatabase,transformed to
matchthedatawarehouseschemaand loaded intothedata
warehousedatabase
ManydatawarehousesalsoincorporatedatafromnonOLTP
systems suchastextfiles,legacysystems,andspreadsheets;
suchdataalsorequiresextraction,transformation,and
loading
WhendefiningETLforadatawarehouse,itisimportantto
thinkofETLasa process,notaphysical implementation
ETLOverview
ETLisoftenacomplexcombinationofprocessandtechnology
thatconsumesasignificantportionofthedatawarehouse
developmenteffortsandrequirestheskillsofbusiness
analysts,databasedesigners,andapplicationdevelopers
Itisnotaonetimeeventasnewdata isaddedtotheData
Warehouseperiodically monthly,daily,hourly
BecauseETLisanintegral,ongoing,andrecurringpartofadata
warehouse
Automated
Welldocumented
Easilychangeable
ETLStagingDatabase
ETLoperationsshouldbeperformedonarelational
databaseserverseparate fromthesourcedatabases
andthedatawarehousedatabase
Createsalogicalandphysicalseparationbetweenthe
sourcesystemsandthedatawarehouse
MinimizestheimpactoftheintenseperiodicETL
activityonsourceanddatawarehousedatabases
Extraction
Extraction
Theintegrationofallofthedisparatesystems acrossthe
enterpriseistherealchallengetogettingthedatawarehouse
toastatewhereitisusable
Dataisextractedfromheterogeneous datasources
Eachdatasourcehasitsdistinctsetofcharacteristicsthat
needtobemanagedandintegratedintotheETL systemin
ordertoeffectivelyextractdata.
Extraction
ETLprocessneedstoeffectivelyintegratesystemsthathave
different:
DBMS
OperatingSystems
Hardware
Communicationprotocols
Needtohavealogicaldatamap beforethephysicaldatacanbe
transformed
Thelogicaldatamapdescribestherelationship betweenthe
extremestartingpointsandtheextremeendingpointsofyourETL
systemusuallypresentedinatableorspreadsheet
Target Source Transformation
Table Name Column Name Data Type Table Name Column Name Data Type
Thecontentofthelogicaldatamappingdocumenthasbeenproventobethecriticalelement
requiredtoefficientlyplanETLprocesses
Thetabletypegivesusourqueuefortheordinalpositionofourdataloadprocessesfirst
dimensions,thenfacts.
TheprimarypurposeofthisdocumentistoprovidetheETLdeveloperwithaclearcut
blueprintofexactlywhatisexpectedfromtheETLprocess.Thistablemustdepict,without
question,thecourseofactioninvolvedinthetransformationprocess
Thetransformationcancontainanythingfromtheabsolutesolutiontonothingatall.Most
often,thetransformationcanbeexpressedinSQL.TheSQLmayormaynotbethecomplete
statement
Theanalysisofthesourcesystemisusually
brokenintotwomajorphases:
Thedatadiscoveryphase
Theanomalydetectionphase
Extraction Data
DiscoveryPhase
DataDiscoveryPhase
keycriterionforthesuccessofthedata
warehouseisthecleanlinessandcohesiveness
ofthedatawithinit
Onceyouunderstandwhatthetargetneeds
tolooklike,youneedtoidentifyandexamine
thedatasources
DataDiscoveryPhase
ItisuptotheETLteamtodrilldownfurtherintothedatarequirementsto
determineeachandeverysourcesystem,table,andattributerequiredto
loadthedatawarehouse
CollectingandDocumentingSourceSystems
Keepingtrack ofsourcesystems
DeterminingtheSystemofRecord Pointoforiginatingofdata
Definitionofthesystemofrecordisimportantbecauseinmostenterprises
dataisstoredredundantly acrossmanydifferentsystems.
Enterprisesdothistomakenonintegratedsystemssharedata.Itisvery
commonthatthesamepieceofdataiscopied,moved,manipulated,
transformed,altered,cleansed,ormadecorruptthroughouttheenterprise,
resultinginvaryingversionsofthesame data
DataContentAnalysis
Extraction
Understandingthecontentofthedataiscrucialfordeterminingthebestapproach
forretrieval
NULLvalues. AnunhandledNULLvaluecandestroyanyETLprocess.NULLvalues
posethebiggestriskwhentheyareinforeignkeycolumns.Joiningtwoormore
tablesbasedonacolumnthatcontainsNULLvalueswillcausedataloss!
Remember,inarelationaldatabaseNULLisnotequaltoNULL.Thatiswhythose
joinsfail.CheckforNULLvaluesineveryforeignkeyinthesourcedatabase.When
NULLvaluesarepresent,youmustouter jointhetables
Datesinnondatefields. Datesareverypeculiarelementsbecausetheyarethe
onlylogicalelementsthatcancomeinvariousformats,literallycontaining
differentvaluesandhavingtheexactsamemeaning.Fortunately,mostdatabase
systemssupportmostofthevariousformatsfordisplaypurposesbutstorethem
inasinglestandardformat
Duringtheinitialload,capturingchangestodatacontentin
thesourcedataisunimportantbecauseyouaremostlikely
extractingtheentiredatasourceorapotionofitfroma
predeterminedpointintime.
Latertheabilitytocapturedatachangesinthesourcesystem
instantlybecomespriority
TheETLteamisresponsibleforcapturingdatacontent
changesduringtheincrementalload.
DeterminingChangedData
AuditColumns UsedbyDBandupdatedbytriggers
Auditcolumns areappendedtotheendofeachtabletostore
thedateandtimearecordwasaddedormodified
Youmustanalyzeandtesteachofthecolumnstoensurethat
itisareliablesourcetoindicatechangeddata.Ifyoufindany
NULLvalues,youmusttofindanalternativeapproachfor
detectingchange exampleusingouterjoins
DeterminingChangedData
ProcessofElimination
Processofeliminationpreservesexactlyonecopyofeach
previousextractioninthestagingareaforfutureuse.
Duringthenextrun,theprocesstakestheentiresource
table(s)intothestagingareaandmakesacomparisonagainst
theretaineddatafromthelastprocess.
Onlydifferences(deltas)aresenttothedatawarehouse.
Notthemostefficienttechnique,butmostreliablefor
capturingchangeddata
Determining Changed Data
InitialandIncrementalLoads
Createtwotables:previousloadandcurrentload.
Theinitialprocessbulkloadsintothecurrentloadtable.Sincechange
detectionisirrelevantduringtheinitialload,thedatacontinuesontobe
transformedandloadedintotheultimatetargetfacttable.
Whentheprocessiscomplete,itdropsthepreviousloadtable,renames
thecurrentloadtabletopreviousload,andcreatesanemptycurrentload
table.Sincenoneofthesetasksinvolvedatabaselogging,theyarevery
fast!
Thenexttimetheloadprocessisrun,thecurrentloadtableispopulated.
SelectthecurrentloadtableMINUSthepreviousloadtable.Transform
andloadtheresultsetintothedatawarehouse.
Transformation
Transformation
MainstepwheretheETLaddsvalue
Actuallychangesdataandprovidesguidance
whetherdatacanbeusedforitsintended
purposes
Performedinstagingarea
Transformation
DataQualityparadigm
Correct
Unambiguous
Consistent
Complete
Dataqualitychecksarerunat2places after
extractionandaftercleaningandconfirming
additionalcheckarerunatthispoint
Transformation CleaningData
AnomalyDetection
Datasampling count(*)oftherowsforadepartment
column
ColumnPropertyEnforcement
NullValuesinreqdcolumns
Numericvaluesthatfalloutsideofexpectedhighandlows
Colswhoselengthsareexceptionallyshort/long
Colswithcertainvaluesoutsideofdiscretevalidvaluesets
Adherencetoareqdpattern/memberofasetofpattern
Transformation Confirming
StructureEnforcement
Tableshaveproperprimaryandforeignkeys
Obeyreferentialintegrity
DataandRulevalueenforcement
Simplebusinessrules
Logicaldatachecks
Stop
Yes
Cleaning
Fatal Errors No Loading
Staged Data And
Confirming
Loading
LoadingDimensions
LoadingFacts
LoadingDimensions
Physicallybuilttohavetheminimalsetsofcomponents
Theprimarykeyisasinglefieldcontainingmeaningless
uniqueinteger SurrogateKeys
TheDWownsthesekeysandneverallowsanyotherentityto
assignthem
Denormalizedflattables allattributesinadimensionmust
takeonasinglevalueinthepresenceofadimensionprimary
key.
Shouldpossessoneormoreotherfieldsthatcomposethe
naturalkeyofthedimension
Thedataloadingmoduleconsistsofallthestepsrequiredto
administerslowlychangingdimensions(SCD) andwritethe
dimensiontodiskasaphysicaltableintheproper
dimensionalformatwithcorrectprimarykeys,correctnatural
keys,andfinaldescriptiveattributes.
Creatingandassigningthesurrogatekeys occurinthis
module.
Thetableisdefinitelystaged,sinceitistheobjecttobe
loadedintothepresentationsystemofthedatawarehouse.
Loadingdimensions
WhenDWreceivesnotificationthatan
existingrowindimensionhaschangeditgives
out3typesofresponses
Type1
Type2
Type3
Type1Dimension
Type2Dimension
Type3Dimensions
Loadingfacts
Facts
Facttablesholdthemeasurementsofan
enterprise.Therelationshipbetweenfact
tablesandmeasurementsisextremelysimple.
Ifameasurementexists,itcanbemodeledas
afacttablerow.Ifafacttablerowexists,itisa
measurement
KeyBuildingProcess Facts
Whenbuildingafacttable,thefinalETLstepisconvertingthe
naturalkeysinthenewinputrecordsintothecorrect,
contemporarysurrogatekeys
ETLmaintainsaspecialsurrogatekeylookuptableforeach
dimension.Thistableisupdatedwheneveranewdimension
entityiscreatedandwheneveraType2 changeoccursonan
existingdimensionentity
Alloftherequiredlookuptablesshouldbepinnedinmemory
sothattheycanberandomlyaccessedaseachincomingfact
recordpresentsitsnaturalkeys.Thisisoneofthereasonsfor
makingthelookuptablesseparatefromtheoriginaldata
warehousedimensiontables.
KeyBuildingProcess
LoadingFactTables
ManagingIndexes
PerformanceKillersatloadtime
Dropallindexesinpreloadtime
SegregateUpdatesfrominserts
Loadupdates
Rebuildindexes
ManagingPartitions
Partitionsallowatable(anditsindexes)tobephysicallydividedinto
minitables foradministrativepurposesandtoimprovequery
performance
Themostcommonpartitioningstrategyonfacttablesistopartition
thetablebythedatekey.Becausethedatedimensionispreloaded
andstatic,youknowexactlywhatthesurrogatekeysare
Needtopartitionthefacttableonthekeythatjoinstothedate
dimensionfortheoptimizertorecognizetheconstraint.
TheETLteammustbeadvisedofanytablepartitionsthatneedtobe
maintained.
OutwittingtheRollbackLog
Therollbacklog,alsoknownastheredolog,isinvaluablein
transaction(OLTP)systems.Butinadatawarehouse
environmentwherealltransactionsaremanagedbytheETL
process,therollbacklogisasuperfluousfeature thatmustbe
dealtwithtoachieveoptimalloadperformance.Reasonswhy
thedatawarehousedoesnotneedrollbackloggingare:
AlldataisenteredbyamanagedprocesstheETLsystem.
Dataisloadedinbulk.
Datacaneasilybereloadedifaloadprocessfails.
Eachdatabasemanagementsystemhasdifferentloggingfeaturesand
managesitsrollbacklogdifferently