You are on page 1of 39

ETL Process in Data Warehouse

G.Lakshmi Priya & Razia Sultana.A


Assistant Professor/IT
Outline

ETL
Extraction
Transformation
Loading
ETLOverview

ExtractionTransformationLoading ETL
Togetdataoutofthesourceandloaditintothedata
warehouse simplyaprocessofcopyingdatafromone
databasetoother
Dataisextracted fromanOLTPdatabase,transformed to
matchthedatawarehouseschemaand loaded intothedata
warehousedatabase
ManydatawarehousesalsoincorporatedatafromnonOLTP
systems suchastextfiles,legacysystems,andspreadsheets;
suchdataalsorequiresextraction,transformation,and
loading
WhendefiningETLforadatawarehouse,itisimportantto
thinkofETLasa process,notaphysical implementation
ETLOverview

ETLisoftenacomplexcombinationofprocessandtechnology
thatconsumesasignificantportionofthedatawarehouse
developmenteffortsandrequirestheskillsofbusiness
analysts,databasedesigners,andapplicationdevelopers
Itisnotaonetimeeventasnewdata isaddedtotheData
Warehouseperiodically monthly,daily,hourly
BecauseETLisanintegral,ongoing,andrecurringpartofadata
warehouse
Automated
Welldocumented
Easilychangeable
ETLStagingDatabase

ETLoperationsshouldbeperformedonarelational
databaseserverseparate fromthesourcedatabases
andthedatawarehousedatabase
Createsalogicalandphysicalseparationbetweenthe
sourcesystemsandthedatawarehouse
MinimizestheimpactoftheintenseperiodicETL
activityonsourceanddatawarehousedatabases
Extraction
Extraction

Theintegrationofallofthedisparatesystems acrossthe
enterpriseistherealchallengetogettingthedatawarehouse
toastatewhereitisusable
Dataisextractedfromheterogeneous datasources
Eachdatasourcehasitsdistinctsetofcharacteristicsthat
needtobemanagedandintegratedintotheETL systemin
ordertoeffectivelyextractdata.
Extraction

ETLprocessneedstoeffectivelyintegratesystemsthathave
different:
DBMS
OperatingSystems
Hardware
Communicationprotocols

Needtohavealogicaldatamap beforethephysicaldatacanbe
transformed

Thelogicaldatamapdescribestherelationship betweenthe
extremestartingpointsandtheextremeendingpointsofyourETL
systemusuallypresentedinatableorspreadsheet
Target Source Transformation

Table Name Column Name Data Type Table Name Column Name Data Type

Thecontentofthelogicaldatamappingdocumenthasbeenproventobethecriticalelement
requiredtoefficientlyplanETLprocesses

Thetabletypegivesusourqueuefortheordinalpositionofourdataloadprocessesfirst
dimensions,thenfacts.

TheprimarypurposeofthisdocumentistoprovidetheETLdeveloperwithaclearcut
blueprintofexactlywhatisexpectedfromtheETLprocess.Thistablemustdepict,without
question,thecourseofactioninvolvedinthetransformationprocess

Thetransformationcancontainanythingfromtheabsolutesolutiontonothingatall.Most
often,thetransformationcanbeexpressedinSQL.TheSQLmayormaynotbethecomplete
statement
Theanalysisofthesourcesystemisusually
brokenintotwomajorphases:
Thedatadiscoveryphase
Theanomalydetectionphase
Extraction Data
DiscoveryPhase
DataDiscoveryPhase
keycriterionforthesuccessofthedata
warehouseisthecleanlinessandcohesiveness
ofthedatawithinit
Onceyouunderstandwhatthetargetneeds
tolooklike,youneedtoidentifyandexamine
thedatasources
DataDiscoveryPhase

ItisuptotheETLteamtodrilldownfurtherintothedatarequirementsto
determineeachandeverysourcesystem,table,andattributerequiredto
loadthedatawarehouse

CollectingandDocumentingSourceSystems
Keepingtrack ofsourcesystems
DeterminingtheSystemofRecord Pointoforiginatingofdata
Definitionofthesystemofrecordisimportantbecauseinmostenterprises
dataisstoredredundantly acrossmanydifferentsystems.
Enterprisesdothistomakenonintegratedsystemssharedata.Itisvery
commonthatthesamepieceofdataiscopied,moved,manipulated,
transformed,altered,cleansed,ormadecorruptthroughouttheenterprise,
resultinginvaryingversionsofthesame data
DataContentAnalysis
Extraction
Understandingthecontentofthedataiscrucialfordeterminingthebestapproach
forretrieval
NULLvalues. AnunhandledNULLvaluecandestroyanyETLprocess.NULLvalues
posethebiggestriskwhentheyareinforeignkeycolumns.Joiningtwoormore
tablesbasedonacolumnthatcontainsNULLvalueswillcausedataloss!
Remember,inarelationaldatabaseNULLisnotequaltoNULL.Thatiswhythose
joinsfail.CheckforNULLvaluesineveryforeignkeyinthesourcedatabase.When
NULLvaluesarepresent,youmustouter jointhetables
Datesinnondatefields. Datesareverypeculiarelementsbecausetheyarethe
onlylogicalelementsthatcancomeinvariousformats,literallycontaining
differentvaluesandhavingtheexactsamemeaning.Fortunately,mostdatabase
systemssupportmostofthevariousformatsfordisplaypurposesbutstorethem
inasinglestandardformat
Duringtheinitialload,capturingchangestodatacontentin
thesourcedataisunimportantbecauseyouaremostlikely
extractingtheentiredatasourceorapotionofitfroma
predeterminedpointintime.

Latertheabilitytocapturedatachangesinthesourcesystem
instantlybecomespriority

TheETLteamisresponsibleforcapturingdatacontent
changesduringtheincrementalload.
DeterminingChangedData

AuditColumns UsedbyDBandupdatedbytriggers

Auditcolumns areappendedtotheendofeachtabletostore
thedateandtimearecordwasaddedormodified

Youmustanalyzeandtesteachofthecolumnstoensurethat
itisareliablesourcetoindicatechangeddata.Ifyoufindany
NULLvalues,youmusttofindanalternativeapproachfor
detectingchange exampleusingouterjoins
DeterminingChangedData

ProcessofElimination
Processofeliminationpreservesexactlyonecopyofeach
previousextractioninthestagingareaforfutureuse.
Duringthenextrun,theprocesstakestheentiresource
table(s)intothestagingareaandmakesacomparisonagainst
theretaineddatafromthelastprocess.
Onlydifferences(deltas)aresenttothedatawarehouse.
Notthemostefficienttechnique,butmostreliablefor
capturingchangeddata
Determining Changed Data
InitialandIncrementalLoads
Createtwotables:previousloadandcurrentload.
Theinitialprocessbulkloadsintothecurrentloadtable.Sincechange
detectionisirrelevantduringtheinitialload,thedatacontinuesontobe
transformedandloadedintotheultimatetargetfacttable.
Whentheprocessiscomplete,itdropsthepreviousloadtable,renames
thecurrentloadtabletopreviousload,andcreatesanemptycurrentload
table.Sincenoneofthesetasksinvolvedatabaselogging,theyarevery
fast!
Thenexttimetheloadprocessisrun,thecurrentloadtableispopulated.
SelectthecurrentloadtableMINUSthepreviousloadtable.Transform
andloadtheresultsetintothedatawarehouse.
Transformation
Transformation

MainstepwheretheETLaddsvalue
Actuallychangesdataandprovidesguidance
whetherdatacanbeusedforitsintended
purposes
Performedinstagingarea
Transformation

DataQualityparadigm
Correct
Unambiguous
Consistent
Complete
Dataqualitychecksarerunat2places after
extractionandaftercleaningandconfirming
additionalcheckarerunatthispoint
Transformation CleaningData

AnomalyDetection
Datasampling count(*)oftherowsforadepartment
column
ColumnPropertyEnforcement
NullValuesinreqdcolumns
Numericvaluesthatfalloutsideofexpectedhighandlows
Colswhoselengthsareexceptionallyshort/long
Colswithcertainvaluesoutsideofdiscretevalidvaluesets
Adherencetoareqdpattern/memberofasetofpattern
Transformation Confirming

StructureEnforcement
Tableshaveproperprimaryandforeignkeys
Obeyreferentialintegrity

DataandRulevalueenforcement
Simplebusinessrules
Logicaldatachecks
Stop

Yes

Cleaning
Fatal Errors No Loading
Staged Data And
Confirming
Loading

LoadingDimensions
LoadingFacts
LoadingDimensions

Physicallybuilttohavetheminimalsetsofcomponents
Theprimarykeyisasinglefieldcontainingmeaningless
uniqueinteger SurrogateKeys
TheDWownsthesekeysandneverallowsanyotherentityto
assignthem
Denormalizedflattables allattributesinadimensionmust
takeonasinglevalueinthepresenceofadimensionprimary
key.
Shouldpossessoneormoreotherfieldsthatcomposethe
naturalkeyofthedimension
Thedataloadingmoduleconsistsofallthestepsrequiredto
administerslowlychangingdimensions(SCD) andwritethe
dimensiontodiskasaphysicaltableintheproper
dimensionalformatwithcorrectprimarykeys,correctnatural
keys,andfinaldescriptiveattributes.
Creatingandassigningthesurrogatekeys occurinthis
module.
Thetableisdefinitelystaged,sinceitistheobjecttobe
loadedintothepresentationsystemofthedatawarehouse.
Loadingdimensions

WhenDWreceivesnotificationthatan
existingrowindimensionhaschangeditgives
out3typesofresponses
Type1
Type2
Type3
Type1Dimension
Type2Dimension
Type3Dimensions
Loadingfacts

Facts
Facttablesholdthemeasurementsofan
enterprise.Therelationshipbetweenfact
tablesandmeasurementsisextremelysimple.
Ifameasurementexists,itcanbemodeledas
afacttablerow.Ifafacttablerowexists,itisa
measurement
KeyBuildingProcess Facts

Whenbuildingafacttable,thefinalETLstepisconvertingthe
naturalkeysinthenewinputrecordsintothecorrect,
contemporarysurrogatekeys
ETLmaintainsaspecialsurrogatekeylookuptableforeach
dimension.Thistableisupdatedwheneveranewdimension
entityiscreatedandwheneveraType2 changeoccursonan
existingdimensionentity
Alloftherequiredlookuptablesshouldbepinnedinmemory
sothattheycanberandomlyaccessedaseachincomingfact
recordpresentsitsnaturalkeys.Thisisoneofthereasonsfor
makingthelookuptablesseparatefromtheoriginaldata
warehousedimensiontables.
KeyBuildingProcess
LoadingFactTables

ManagingIndexes
PerformanceKillersatloadtime
Dropallindexesinpreloadtime
SegregateUpdatesfrominserts
Loadupdates
Rebuildindexes
ManagingPartitions
Partitionsallowatable(anditsindexes)tobephysicallydividedinto
minitables foradministrativepurposesandtoimprovequery
performance
Themostcommonpartitioningstrategyonfacttablesistopartition
thetablebythedatekey.Becausethedatedimensionispreloaded
andstatic,youknowexactlywhatthesurrogatekeysare
Needtopartitionthefacttableonthekeythatjoinstothedate
dimensionfortheoptimizertorecognizetheconstraint.
TheETLteammustbeadvisedofanytablepartitionsthatneedtobe
maintained.
OutwittingtheRollbackLog
Therollbacklog,alsoknownastheredolog,isinvaluablein
transaction(OLTP)systems.Butinadatawarehouse
environmentwherealltransactionsaremanagedbytheETL
process,therollbacklogisasuperfluousfeature thatmustbe
dealtwithtoachieveoptimalloadperformance.Reasonswhy
thedatawarehousedoesnotneedrollbackloggingare:
AlldataisenteredbyamanagedprocesstheETLsystem.
Dataisloadedinbulk.
Datacaneasilybereloadedifaloadprocessfails.
Eachdatabasemanagementsystemhasdifferentloggingfeaturesand
managesitsrollbacklogdifferently

You might also like