Professional Documents
Culture Documents
R. Marti
Data Warehousing
Source Database
Data Mart
Interactive Analysis
Source Database
Data Warehouse
Dashboards
Source Database
Data Mart
Focus
Architectural options and variations in data warehouse projects Design of the single integrated data warehouse, in particular - how to model temporal aspects - how to ensure common dimensions (=> Master Data Management)
R. Marti 3-1 DWh 2011: Data Warehouse
Master Data
2
R. Marti
Page 3
Since 1980, a lot of research has been conducted in temporal data models, temporal query languages, and temporal database systems. Generic support for temporal data is beginning to emerge in products: Teradata Database 13.10, IBM DB2 V10, Oracle
R. Marti 3-1 DWh 2011: Data Warehouse 5
Notions of Time
Valid Time is the time during which a fact in the real world was, is, or will be true or, more precisely: was / is believed to be true or believed to become true. Note: This time is determined by the user. Sometimes also called effective time, as of time or business time.
Transaction Time is the time during which a fact in the real world was or is (rightly or wrongly) stored in the database. Note: This time is determined by the system (unless the user decides to delay entering the data, of course ... ) . Sometimes also called system time.
Example of an announcement made (and stored in a DB there and then) on October 1 2010 (= transaction time): David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).
R. Marti 3-1 DWh 2011: Data Warehouse 6
May be impractical as an implementation model, given that it may require lots of space, especially when granularity of time is fine-grained (minutes, seconds, milliseconds, ... ) represented facts do not change often, i.e. stay the same over a longer interval (usually because they describe states rather than events)
R. Marti
Alternatives: (1) (2) associate temporal intervals with every tuple associate temporal intervals with every attribute value (but the 2nd approach requires complex attributes, violating 1NF)
3-1 DWh 2011: Data Warehouse 9
R. Marti
Remarks: Transformation into 1NF by replacing V_INTERVAL by V_FROM (valid from) and V_TO (valid to) The symbol ? means unknown, until now or until further notice. In standard SQL, it is usually represented by null or by the date 9999-12-31, both of which are not entirely satisfactory ...
R. Marti 3-1 DWh 2011: Data Warehouse 10
Remarks: We assume that ID is the primary key at every point in time (in every snapshot). Producing a snapshot from a valid time relation is a simple selection in rel. algebra: select ID, NAME, FNAME, ADDR, SAL from EMP where :t in V_INTERVAL (or: where :t between V_FROM and V_TO )
R. Marti 3-1 DWh 2011: Data Warehouse 11
transformation to 1NF
The first two tuples are called value equivalent since they have the same values in all attributes except the temporal attributes V_FROM and V_TO.
R. Marti 3-1 DWh 2011: Data Warehouse 12
One (of many) possible alternative representations using two (non-maximal) contiguous intervals (assuming a temporal granularity of a day):
R. Marti
13
Example Query: Who left the company before 2008-01-01 and when? select ID, NAME, FNAME, V_TO from EMP where V_TO < date '2008-01-01' (Incorrect) Result:
R. Marti
14
Example: Where did employees live and when (irrespective of salary)? select ID, NAME, FNAME, ADDR, V_FROM, V_TO from EMP Result:
R. Marti
16
Coalesced form:
R. Marti
17
R. Marti
19
DB2 10 for z/OS and Teradata Database V13.10 support most of the SQL/ Temporal proposal.
R. Marti 3-1 DWh 2011: Data Warehouse 21
R. Marti
22
Undo log table containing changes to produce previous contents of associated snaphsot table (before images).
R. Marti 3-1 DWh 2011: Data Warehouse 23
begin l_log.X_TIME := current_timestamp; l_log.UNDO_OP_CODE := 'update'; l_log.ID := :old.ID; l_log.NAME := :old.NAME; l_log.FNAME := :old.FNAME; l_log.ADDR := :old.ADDR; l_log.SAL := :old.SAL; insert into EMP_UNDO_LOG values l_log; end TR_AU_EMP; /
should probably check that ID has not changed and raise an application error if this were the case
R. Marti
Bitemporal Relations
Valid time and transaction time can be combined to allow for a complete history of what information was/is believed to be true and when this was stored in the database.
Example: Complete (bitemporal) history of employee 676" "1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of 2006-08-01.
R. Marti
25
Example (continued): Complete (bi-temporal) history of employee 676" "2. 2008-04-01: update 676 lives in Bern as of 2008-03-01.
R. Marti
26
Example (continued): Complete (bi-temporal) history of employee 676" "3. 2009-11-01: update 676 earns 7500 as of 2010-01-01.
R. Marti
27
Example (continued): Complete (bi-temporal) history of employee 676" "4. 2009-11-11: update correction: 676 earns 7700 as of 2010-01-01.
R. Marti
28
R. Marti
32
At every point in time when the FK value was valid, the referenced PK value must be valid. t ( t(Works_On[Emp_Id]) t (Emp[Id]) )
R. Marti
33
Of course, performing temporal coalescing for - adding tuples to and/or extending time intervals of the referencing relation - deleting tuples from and/or shrinking time intervals in the referenced relation would be an expensive proposition Recommendation: Track complete lifetimes of objects in a separate relation
R. Marti 3-1 DWh 2011: Data Warehouse 34
R. Marti
35
R. Marti
37
R. Marti
38
Recommendation: Attributes whose values change independently of other attributes should be put into different relations (somewhat like achieving 4NF in the face of multi-valued dependencies).
R. Marti 3-1 DWh 2011: Data Warehouse 39
R. Marti
40
For kernel entity types (with objects whose existence is independent of other entities), consider the introduction of an object relation to capture the lifetime of these objects main benefits: - referential integrity checking over time - home for time-invariant attributes For relations representing object properties (or relationships between objects) and their history, consider choosing a temporal primary key consisting of the non-temporal primary key attributes plus a (meaningless) sequence number. For relations representing object properties (or relationships between objects), consider decomposing them into groups of attributes which - are either time-invariant this attribute group is moved to the object relation - or change independently of one another (i.e., potentially at different times) each such attribute group is moved into a separate relation keeping track of the history of the values
R. Marti 3-1 DWh 2011: Data Warehouse
Motivating Example Compare profits over the years - grouped by business divisions - grouped by client ratings What happens if, over time, - business divisions change (e.g. profit centers are shifted)? - ratings of clients change? - two clients merge (e.g., primary insurers in the reinsurance business)?
PRODUCT
PROD_ID
POLICY_PTF
<ForeignKeys> PREMIUM_AMT LOSS_AMT EXPENSE_AMT PROFIT_AMT
PROF_CENTER
PC_ID PC_NAME DIV_ID DIV_NAME
CLIENT
CL_ID CL_NAME CL_RATING
R. Marti 3-1 DWh 2011: Data Warehouse
Slide 42
First impressions
profit [CHF] measure
+24% X -40% Y
+80% Z
time
2009
2010
X2 X3 Y1 Y2 S
+24%
X2 X3 Y1
-40%
+0%
Y2 S
Z1 Z2
Z1 Z2
+80%
+11%
time
2009
2010
R. Marti
Slide 44
L&H Lines
Property
Casualty
Life
Health
ProfitCenter: embedded in hierarchical org structure ProfitCenter Division Group Client: hierarchical groupings possble, e.g., grouping by country continent,
R. Marti 3-1 DWh 2011: Data Warehouse Slide 45
tCapture[1]
tCapture[2]
tReport time
report production which dimensional structure should reported measures refer to? original structures valid at respective capture times (tCapture[i])? need history + valid times need succession mapping structures valid at report time (tReport)? other times?
R. Marti 3-1 DWh 2011: Data Warehouse Slide 46
Running Example
Country
CountryId CountryName
Population
changes CountryId Year
Year
dimension measure
R. Marti
Slide 47
Image
A A A B C
Description New value added Old value (name) will be replaced by new value A value will not any longer be available for new contracts n old values will be merged into one value
A2
Key Questions
Succession Mapping
A1 A2 A A B C D B
A A1 A C D
Old value will be divided into n values One value changes position in hierarchy
Taxonomic Relationship
R. Marti
Slide 48
merge
split
rename
adapted from Temporal Data Warehousing: Business Cases and Solutions, J. Eder et al.
R. Marti 3-1 DWh 2011: Data Warehouse Slide 49
a unique, invariant, not-to-be-reused identifier for the concept that the value represents
e.g. an identifier for the country first named Zaire and later Kongo
a validity period indicating the overall lifetime of the concept which the value represents
e.g. the lifetime of the country first named Zaire and later Kongo
validity periods indicating the lifetime of the values used to represent the concept
e.g. the lifetimes of the names Zaire and Kongo
R. Marti
Slide 50
Unique Identifier
R. Marti
Slide 51
R. Marti
Slide 52
R. Marti
Slide 53
R. Marti
Slide 54
SQL Statement to do all 3 steps SELECT COALESCE(s.CurrId, p.CountryId) AS CountryId , p.Year , SUM(p.Population) AS Population FROM CountryPopulation p LEFT OUTER JOIN CountrySuccession s ON s.Id = p.CountryId GROUP BY p.CountryId, p.Year
R. Marti
Slide 55
R. Marti
Slide 56
Step 2: Extrapolate
R. Marti
Slide 57
Lifecycle of Concepts
Superseded
Active
define successor
Inactive
Introduction as Inactive
Move
Active can be used to book new business and appear on reports Inactive can appear on reports but cannot be used to book new business Superseded cannot appear on reports nor be used to book new business
3-1 DWh 2011: Data Warehouse Slide 58
R. Marti
R. Marti
Slide 59
DB2 Colloquium 2006-10-25 R. Marti 3-1 DWh 2011: Data Warehouse Slide 60
Principle Add valid times in dimensions in the Data Warehouse using - an object table (Country) - a single property table (here: CountryNames) both with an associated valid time interval. Let foreign keys in fact tables refer to the unchanging ID in object tables. Generate standard Data Marts from this data model as needed, most often a history of measure according to the current dimensional structure.
3-1 DWh 2011: Data Warehouse Slide 61
Country
CountryId VTimBeg VTimEnd
CountrySuccession
Id -- original identifier SuccId -- direct successor CurrId -- ultimate successor
Population
CountryId Year
Year
R. Marti
SCD Type 2: versions of some dimensional attributes are needed store new records in the dimension table, with a new DWh identifier (ID), the existing stable source system ID, and the new (changed) values
e.g. a change in the rating of a client, or the new business division a profit center belongs to
SCD Type 3: current and original (or previous) versions are needed introduce a current and original attribute in the dimension table
e.g. the current rating and the original rating of each client
R. Marti 3-1 DWh 2011: Data Warehouse Slide 63
Conses Simplicity for business users is deceiving A change in a dimensional attribute effectively changes the context for all facts captured prior to the change
R. Marti
Slide 64
Literature
General Temporal Database Concepts [Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann, 1999. (see http://www.cs.arizona.edu/people/rts/publications.html) [Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems Applications - DEXA '06, 2006. [Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. Morgan Kaufmann, 2010.
Data Warehouse Design [Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition. John Wiley, 2002. [Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003. [Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and Methodologies. McGraw Hill, 2009. [Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.
R. Marti 3-1 DWh 2011: Data Warehouse Slide 67