You are on page 1of 67

3-1 Data Warehouse The Time Dimension

Data Warehousing Spring Semester 2011

R. Marti

The Data Warehouse in the DWh Reference Architecture


Reports

Data Warehousing
Source Database

Data Mart
Interactive Analysis

Source Database

Data Warehouse
Dashboards

Source Database

Data Mart

Focus
Architectural options and variations in data warehouse projects Design of the single integrated data warehouse, in particular - how to model temporal aspects - how to ensure common dimensions (=> Master Data Management)
R. Marti 3-1 DWh 2011: Data Warehouse

Master Data
2

Recap: Time in Classical Data Mart Designs (1)

R. Marti

3-1 DWh 2011: Data Warehouse

Page 3

Recap: Time in Classical Data Mart Designs (2)


Rows in fact tables are associated with a specific time by the foreign key reference to the time dimension, indicating as of when they are valid. However, rows in dimension tables are not associated with a time!
- new rows (rows with an unknown source system identifier) are simply added - usually, no rows are deleted from a dimension table, even if rows with known source system identifiers are missing in a batch upload: . existing (old) facts still refer to objects corresponding to these missing rows . if sources do not send explicit information on deletions, it is unclear whether the corresponding objects have effectively become invalid or not (Note: Sending this information might mean re-designing the source system!) - changes in values of dimension rows with known source system identifiers are . either simply overwritten, . or a new row with a new surrogate (but the old source system id) is added (see topic slowly changing dimensions)
R. Marti 3-1 DWh 2011: Data Warehouse 4

Temporal Database Systems + Languages


For some types of analysis, dimensions should also be historized, especially for comparisons of measures across different time periods. Example: How did buying habits of customers change over the last few years, grouped by where they live. History of addresses of customers should also be kept!

Since 1980, a lot of research has been conducted in temporal data models, temporal query languages, and temporal database systems. Generic support for temporal data is beginning to emerge in products: Teradata Database 13.10, IBM DB2 V10, Oracle
R. Marti 3-1 DWh 2011: Data Warehouse 5

Notions of Time
Valid Time is the time during which a fact in the real world was, is, or will be true or, more precisely: was / is believed to be true or believed to become true. Note: This time is determined by the user. Sometimes also called effective time, as of time or business time.

Transaction Time is the time during which a fact in the real world was or is (rightly or wrongly) stored in the database. Note: This time is determined by the system (unless the user decides to delay entering the data, of course ... ) . Sometimes also called system time.

Example of an announcement made (and stored in a DB there and then) on October 1 2010 (= transaction time): David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).
R. Marti 3-1 DWh 2011: Data Warehouse 6

Associating Time with Data


Assumption: For each relation, a clock with a given temporal granularity is specied, e.g., a day, a second, or a millisecond." Conceptually, the extension of a temporal relation R can then be viewed as a sequence of snapshot relations Rt = t(R) for every time point t of this clock." time attributes snapshot at time t tuples

t is called snapshot operator (sometimes also timeslice operator)"


R. Marti 3-1 DWh 2011: Data Warehouse 7

Benefits and Pitfalls of Sequence of Snapshots Model


Good for theoretical considerations, in particular determining equivalence of different temporal representations gauging the expressive power of temporal query languages

May be impractical as an implementation model, given that it may require lots of space, especially when granularity of time is fine-grained (minutes, seconds, milliseconds, ... ) represented facts do not change often, i.e. stay the same over a longer interval (usually because they describe states rather than events)

R. Marti

3-1 DWh 2011: Data Warehouse

From Sequence of Snapshots Model to Time Intervals


Remedy: Dont store data that did not change since the previous clock tick again Collect identical snapshots of suitable smaller parts of a relation (e.g., tuples or attribute values) and associate them with time intervals rather than time points

Alternatives: (1) (2) associate temporal intervals with every tuple associate temporal intervals with every attribute value (but the 2nd approach requires complex attributes, violating 1NF)
3-1 DWh 2011: Data Warehouse 9

R. Marti

Valid Time Relations capturing State


Conceptually, every tuple which captures a state is timestamped with a time interval [tfrom, tto] indicating the validity of the (non-temporal) data represented in the tuple

Remarks: Transformation into 1NF by replacing V_INTERVAL by V_FROM (valid from) and V_TO (valid to) The symbol ? means unknown, until now or until further notice. In standard SQL, it is usually represented by null or by the date 9999-12-31, both of which are not entirely satisfactory ...
R. Marti 3-1 DWh 2011: Data Warehouse 10

Typical Queries (1): Snapshot of Valid Time Relation


Snapshots of the previous valid time relation:

Remarks: We assume that ID is the primary key at every point in time (in every snapshot). Producing a snapshot from a valid time relation is a simple selection in rel. algebra: select ID, NAME, FNAME, ADDR, SAL from EMP where :t in V_INTERVAL (or: where :t between V_FROM and V_TO )
R. Marti 3-1 DWh 2011: Data Warehouse 11

Valid Time Relations capturing Recurring States


A specific state of affairs can recur several times ( several time periods)

transformation to 1NF

The first two tuples are called value equivalent since they have the same values in all attributes except the temporal attributes V_FROM and V_TO.
R. Marti 3-1 DWh 2011: Data Warehouse 12

Options in the Representation of Time


Canonical representation using maximal time intervals (as on previous slide):

One (of many) possible alternative representations using two (non-maximal) contiguous intervals (assuming a temporal granularity of a day):

R. Marti

3-1 DWh 2011: Data Warehouse

13

Issues with non-canonical Representations


Non-canonical representations may lead to incorrect answers:

Example Query: Who left the company before 2008-01-01 and when? select ID, NAME, FNAME, V_TO from EMP where V_TO < date '2008-01-01' (Incorrect) Result:

R. Marti

3-1 DWh 2011: Data Warehouse

14

Avoiding non-canonical Representations: By Design


Ensure that intervals remain maximal when inserting or updating: Let R be a valid time relation in canonical form (i.e., with maximal time intervals) n be a new valid time tuple to be inserted into the relation R x1, ... , xn (n 0) be all existing valid time tuple in relation R which are value equivalent to x (cf. p. 12) Then, for all i, 0 i n, the following must hold (in pseudo-SQL notation): not exists ( value equivalence select * from R xi intervals do not touch or overlap where xi = n and (n.V_FROM - 1 between xi.V_FROM and xi.V_TO or n.V_TO + 1 between xi.V_FROM and xi.V_TO) )
(This could be specified as declarative check constraint if implementation supported it )
R. Marti 3-1 DWh 2011: Data Warehouse 15

Typical Queries (2): Temporal Projection


Unfortunately, (intermediate) query results may be non-canonical, even if applied to a canonical representation:

Example: Where did employees live and when (irrespective of salary)? select ID, NAME, FNAME, ADDR, V_FROM, V_TO from EMP Result:

R. Marti

3-1 DWh 2011: Data Warehouse

16

Avoiding non-canonical Representations: By Coalescing


Non-canonical representations can be transformed into the canonical representation by an operation called temporal coalescing which maximizes the length of all intervals by coalescing adjacent and overlapping intervals of value-equivalent tuples.

Coalesced form:

R. Marti

3-1 DWh 2011: Data Warehouse

17

Temporal Coalescing in (Pseudo-) SQL


with recursive Rclos as ( -- initial ("anchor") query select R.values, R.V_FROM, R.V_TO from R union -- recursive query: executed until no new data generated select R.values, R.V_FROM, Rclos.V_TO from R, Rclos where Rclos.values = R.values and Rclos.V_FROM >= R.V_FROM and Rclos.V_FROM-1 <= R.V_TO ) select Rclos.values, Rclos.V_FROM, Rclos.V_TO from Rclos where not exists ( select * from R more efficient where R.values = Rclos.values implementation and ( R.V_FROM < Rclos.V_FROM uses window or R.V_TO > Rclos.V_TO ) functions (see [Zhou et al 2006]) )
R. Marti 3-1 DWh 2011: Data Warehouse 18

Typical Queries (3): Temporal Join


Sometimes, the history of information stored in two relations is of interest:

Example: Who worked on which projects and when? Result:

R. Marti

3-1 DWh 2011: Data Warehouse

19

Temporal Join in SQL (without temporal coalescing!)


Construct time intervals of result by intersecting time intervals of operands (and keeping rows with non-empty intervals): select * from ( select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME, case when e.V_FROM > w.V_FROM then e.V_FROM else w.V_FROM end as V_FROM, case when e.V_TO < w.V_TO then e.V_TO else w.V_TO end as V_TO from WORKS_ON w, EMP e where e.ID = w.EMP_ID ) where V_FROM <= V_TO
Note: This gets more tedious when (temporally) joining 3 or more relations
R. Marti 3-1 DWh 2011: Data Warehouse 20

Proposals for Temporal Support in SQL


There are proposals to hide this (and more, see following slides) temporal complexity in SQL, e.g., the SQL/Temporal part of a future SQL3 standard. A temporal join (including temporal coalescing) would look as follows: validtime select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME, from WORKS_ON w, EMP e where e.ID = w.EMP_ID see e.g. [Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann, 1999.
Note: This publication is out of print, but available electronically as pdf a http://www.cs.arizona.edu/people/rts/publications.html

DB2 10 for z/OS and Teradata Database V13.10 support most of the SQL/ Temporal proposal.
R. Marti 3-1 DWh 2011: Data Warehouse 21

Transaction Time Relations


Note that transaction time should be automatically determined by the system at insert/update/delete time (or, more precise, commit time), not by the user; granularity is typically as fine as possible Transaction time can be represented exactly like valid time, by associating a time interval with tuples.
Example: Transaction time history of employee 676 (also see slide 10)" "1. 2006-07-01: insert 676 lives in Baar und earns 7000.
"2. 2008-04-01: update 676 lives in Bern.
"3. 2009-11-01: update 676 earns 7500."

R. Marti

3-1 DWh 2011: Data Warehouse

22

Using DBMS Logging to capture Transaction Time


Since transaction time can be automatically determined by the system, the DBMS logging facilities can be used. This is/was done e.g. in Postgres/PostgreSQL/Illustra (and in Oracle). Example: Transaction time history of employee 676 (see slide 15)"
"1. 2006-07-01: insert 676 lives in Baar and earns 7000.
"2. 2008-04-01: update 676 lives in Bern.
"3. 2009-11-01: update 676 earns 7500.
Normal (snapshot) table containing current contents.

Undo log table containing changes to produce previous contents of associated snaphsot table (before images).
R. Marti 3-1 DWh 2011: Data Warehouse 23

Implementing Logging Using Triggers


create or replace trigger TR_AU_EMP after update on EMP for each row declare l_log EMP_UNDO_LOG%rowtype;

written in Oracle PL/SQL

begin l_log.X_TIME := current_timestamp; l_log.UNDO_OP_CODE := 'update'; l_log.ID := :old.ID; l_log.NAME := :old.NAME; l_log.FNAME := :old.FNAME; l_log.ADDR := :old.ADDR; l_log.SAL := :old.SAL; insert into EMP_UNDO_LOG values l_log; end TR_AU_EMP; /

should probably check that ID has not changed and raise an application error if this were the case

similar triggers required for inserts and deletes


24

R. Marti

3-1 DWh 2011: Data Warehouse

Bitemporal Relations
Valid time and transaction time can be combined to allow for a complete history of what information was/is believed to be true and when this was stored in the database.

Example: Complete (bitemporal) history of employee 676" "1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of 2006-08-01.

R. Marti

3-1 DWh 2011: Data Warehouse

25

Bitemporal Relations (2)

Example (continued): Complete (bi-temporal) history of employee 676" "2. 2008-04-01: update 676 lives in Bern as of 2008-03-01.

R. Marti

3-1 DWh 2011: Data Warehouse

26

Bitemporal Relations (3)

Example (continued): Complete (bi-temporal) history of employee 676" "3. 2009-11-01: update 676 earns 7500 as of 2010-01-01.

R. Marti

3-1 DWh 2011: Data Warehouse

27

Bitemporal Relations (4)

Example (continued): Complete (bi-temporal) history of employee 676" "4. 2009-11-11: update correction: 676 earns 7700 as of 2010-01-01.

R. Marti

3-1 DWh 2011: Data Warehouse

28

Design of Temporal Databases


Basic idea Do non-temporal database design Annotate which tables / attributes need to be historized (especially valid time) and how (state-based vs. event-based) Generate temporal data structures ... but how? Questions: Entity integrity (implemented by primary keys) temporal entity integrity Referential integrity (implemented by foreign keys) temporal referential integrity Arbiter: sequence of snapshots model
R. Marti 3-1 DWh 2011: Data Warehouse 29

Temporal Entity Integrity (1)


Temporal entity integrity = for every snapshot, entity integrity should hold. Pro memoria: - primary keys should consist of a minimal number of attributes which unqiuely identify each tuple - these attributes should ideally not change over time Options for the primary key of a valid time relation (e.g. for table EMP) (1) ID, V_FROM (2) ID, V_TO (3) ID, V_FROM, V_TO (non-minimal primary key!) (4) ID, SEQ_NO (where SEQ_NO is a sequence number or counter) Since all attributes except ID (and SEQ_NO) can change over the lifetime of the identified tuple - alternative (4) is probably the best, - followed by alternative (1) as V_FROM only changes in case of an error (and should not be referenced by foreign keys, as well see)
R. Marti 3-1 DWh 2011: Data Warehouse 30

Temporal Entity Integrity (2)


In addition, it might be desirable to enforce other constraints, including Time intervals must not be empty Time intervals should be maximal (unless e.g. queries like what was the case before or after a specific point in time are not of importance)
create table EMP ( ID integer not null, SEQ_NO integer not null, NAME varchar(20) not null, ... V_FROM date not null, V_TO date default date '9999-12-31', primary key (ID, SEQ_NO), check ( V_FROM <= V_TO ), check ( not exists ( select * from EMP other where other.ID = ID and other.NAME = NAME and ... and ( other.V_FROM between V_FROM-1 and V_TO+1 or other.V_TO between V_FROM-1 and V_TO+1 ) ) ) )
R. Marti 3-1 DWh 2011: Data Warehouse 31

Referential Integrity between Snapshot Relations


The foreign key (FK) attribute value(s) in the referencing relation must exist as primary key (PK) values in the referenced relation: Example: Works_On[Emp_Id] Emp[Id]
Note: In relational theory, this is sometimes also called an inclusion dependency.

R. Marti

3-1 DWh 2011: Data Warehouse

32

Temporal Referential Integrity (1)


Temporal referential integrity = for every snapshot, referential integrity must hold. Problem: - primary keys now have a temporal part (on top of the non-temporal part) - valid time periods in the foreign key (referencing) relation are not necessarily the same as those of the primary key (referenced) relation

At every point in time when the FK value was valid, the referenced PK value must be valid. t ( t(Works_On[Emp_Id]) t (Emp[Id]) )

R. Marti

3-1 DWh 2011: Data Warehouse

33

Temporal Referential Integrity (2)


t ( t(Works_On[Emp_Id]) t (Emp[Id]) ) holds for employee 676 because projection followed by temporal coalescing would result in:

Of course, performing temporal coalescing for - adding tuples to and/or extending time intervals of the referencing relation - deleting tuples from and/or shrinking time intervals in the referenced relation would be an expensive proposition Recommendation: Track complete lifetimes of objects in a separate relation
R. Marti 3-1 DWh 2011: Data Warehouse 34

Temporal Referential Integrity (3)


Split valid time relation on referenced (PK) side into an object relation and a property relation. Add a referential integrity constraint from property relation to object relation. Re-route non-temporal referential integrity constraints from other relations to the object relation.

R. Marti

3-1 DWh 2011: Data Warehouse

35

Temporal Referential Integrity (4)


In referencing relations, it might be desirable to enforce referential integrity non-temporal part: as usual temporal part: time interval contained in time interval of referenced object
create table WORKS_ON ( EMP_ID integer not null, PROJ_ID integer not null, SEQ_NO integer not null, V_FROM date not null, V_TO date default date '9999-12-31', primary key (EMP_ID, PROJ_ID, SEQ_NO), check ( V_FROM <= V_TO ), foreign key (EMP_ID) references EMP_OBJ(ID), check ( exists ( select * from EMP_OBJ ref where ref.ID = EMP_ID and ref.V_FROM <= V_FROM and ref.V_TO >= V_TO ) ) ... -- e.g. temporal FK to a table PROJ_OBJ )
R. Marti 3-1 DWh 2011: Data Warehouse 36

Temporal Normalization (1): Time-invariant Attributes


Assume that attribute FName cannot change over the lifetime of an Emp (except to correct mistakes). In other words, the functional dependency (FD) Id FName holds relation Emp_Prop below is not in 2NF (attribute depends on part of PK) relation Emp_Prop exhibits update anomalies when having to fix a mistake in Sues first name (e.g. change to Susan)

R. Marti

3-1 DWh 2011: Data Warehouse

37

Temporal Normalization (2): Time-invariant Attributes


Recommendation: Consider moving time-invariant attributes (e.g. FName) from the property relation (e.g. Emp_Prop) to the object relation (e.g. Emp_Obj). In Emp_Obj, the FD Id FName still holds (and is enforced by the PK), so the relation does not exhibit update anomalies. In Emp_Prop, all attributes are now fully dependent on the PK
but there is still an issue ...

R. Marti

3-1 DWh 2011: Data Warehouse

38

Temporal Normalization (3): Asynchronous Changes


Example: After having inserted the salary raise to employe 676 as of beginning of 2010, we learn that she actually moved to Aarau as of Dev 1 2009. update anomaly: several tuples need to be changed (in addition to an insert)!

Recommendation: Attributes whose values change independently of other attributes should be put into different relations (somewhat like achieving 4NF in the face of multi-valued dependencies).
R. Marti 3-1 DWh 2011: Data Warehouse 39

Temporal Normalization (4): Asynchronous Changes


Example: Since address and salary of an employee may change independently (and asynchronuously), these attributes should be put into different relations. no update anomaly: one tuple needs to be changed (in addition to an insert)!

Employee salaries remain untouched:

R. Marti

3-1 DWh 2011: Data Warehouse

40

Summary of Design Recommendations

ollowing Remember: F lunch! them is no free

For kernel entity types (with objects whose existence is independent of other entities), consider the introduction of an object relation to capture the lifetime of these objects main benefits: - referential integrity checking over time - home for time-invariant attributes For relations representing object properties (or relationships between objects) and their history, consider choosing a temporal primary key consisting of the non-temporal primary key attributes plus a (meaningless) sequence number. For relations representing object properties (or relationships between objects), consider decomposing them into groups of attributes which - are either time-invariant this attribute group is moved to the object relation - or change independently of one another (i.e., potentially at different times) each such attribute group is moved into a separate relation keeping track of the history of the values
R. Marti 3-1 DWh 2011: Data Warehouse

Return to (Valid) Time in Warehousing


TIME

Motivating Example Compare profits over the years - grouped by business divisions - grouped by client ratings What happens if, over time, - business divisions change (e.g. profit centers are shifted)? - ratings of clients change? - two clients merge (e.g., primary insurers in the reinsurance business)?

PRODUCT
PROD_ID

POLICY_PTF
<ForeignKeys> PREMIUM_AMT LOSS_AMT EXPENSE_AMT PROFIT_AMT

PROF_CENTER
PC_ID PC_NAME DIV_ID DIV_NAME

CLIENT
CL_ID CL_NAME CL_RATING
R. Marti 3-1 DWh 2011: Data Warehouse

Slide 42

First impressions
profit [CHF] measure

+24% X -40% Y

+80% Z
time

2009

2010

dimensional values (e.g., names of business divisions)


R. Marti 3-1 DWh 2011: Data Warehouse Slide 43

First impressions can be deceiving


profit [CHF]
X1 X1

X2 X3 Y1 Y2 S

+24%

X2 X3 Y1

-40%

+0%

Y2 S

Profit Center Shift

Z1 Z2

Z1 Z2

+80%

+11%

time

2009

2010

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 44

Terminology and Concepts: Dimensional Hierarchies


Dimensions often have a hierarchical structure, e.g., in previous example: Product: hierarchical LineOfBusiness
All Lines

P&C Lines Special Lines

L&H Lines

Property

Casualty

Life

Health

ProfitCenter: embedded in hierarchical org structure ProfitCenter Division Group Client: hierarchical groupings possble, e.g., grouping by country continent,
R. Marti 3-1 DWh 2011: Data Warehouse Slide 45

Coping with Business Change


successful completion of business transaction captured measures refer to dimensional structures valid at this time changes to referenced dimensional structures

tCapture[1]

tCapture[2]

tReport time

report production which dimensional structure should reported measures refer to? original structures valid at respective capture times (tCapture[i])? need history + valid times need succession mapping structures valid at report time (tReport)? other times?
R. Marti 3-1 DWh 2011: Data Warehouse Slide 46

Running Example
Country
CountryId CountryName

Population
changes CountryId Year

Year
dimension measure

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 47

Changes to Dimensional Structures


Type 1 add 2 rename 3 invalidate
A A A B B

Image
A A A B C

Description New value added Old value (name) will be replaced by new value A value will not any longer be available for new contracts n old values will be merged into one value
A2

Key Questions

Succession Mapping

4 merge 5 split 6 move

A1 A2 A A B C D B

A A1 A C D

Old value will be divided into n values One value changes position in hierarchy

Taxonomic Relationship

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 48

Examples of Changes to Dimensional Structures


invalidate add

merge

split

rename

adapted from Temporal Data Warehousing: Business Cases and Solutions, J. Eder et al.
R. Marti 3-1 DWh 2011: Data Warehouse Slide 49

Issues: History, Validity and Succession of Values


Dimensional values to be tracked over time must have
1

a unique, invariant, not-to-be-reused identifier for the concept that the value represents
e.g. an identifier for the country first named Zaire and later Kongo

a validity period indicating the overall lifetime of the concept which the value represents
e.g. the lifetime of the country first named Zaire and later Kongo

validity periods indicating the lifetime of the values used to represent the concept
e.g. the lifetimes of the names Zaire and Kongo

invalid dimensional values must have another dimensional value as successor


e.g., East Germany is succeeded by Germany

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 50

Unique Identifier

DB2 Colloquium 2006-10-25

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 51

Succession of Dimensional Values

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 52

Succession of Dimensional Values


Step 1: Find countries which have a successor

Step 2: Aggregate if two or more countries have the same successor

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 53

Succession of Dimensional Values

Step 3: Reassemble parts

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 54

Succession of Dimensional Values

SQL Statement to do all 3 steps SELECT COALESCE(s.CurrId, p.CountryId) AS CountryId , p.Year , SUM(p.Population) AS Population FROM CountryPopulation p LEFT OUTER JOIN CountrySuccession s ON s.Id = p.CountryId GROUP BY p.CountryId, p.Year

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 55

Side Issue: Difficulties with the Split Operation


Example measures population and GNP (gross national product) have been collected for Czechoslovakia up to 1992 as of 1993, the same measures are collected for Czech and Slovakia Possible solutions after 1993, keep Czechoslovakia and compute its population and GNP figures by summing the figures of Czech and Slovakia before 1992, compute approximate percentages of the population and GNP figures from Czechoslovakia for Czech and Slovakia
note: in general, the precentages of the various measures are not identical

leave countries as is and perform no mapping in either direction

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 56

Handling Splits (Sketch)

Step 1: Aggregate over Taxonomy

Step 2: Extrapolate

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 57

Lifecycle of Concepts

Move Start of validity

Superseded

Active

define successor

Inactive
Introduction as Inactive

Move

Active can be used to book new business and appear on reports Inactive can appear on reports but cannot be used to book new business Superseded cannot appear on reports nor be used to book new business
3-1 DWh 2011: Data Warehouse Slide 58

R. Marti

Validity (Lifetime) of Concepts

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 59

Validity (Lifetime) of Names of Concepts

DB2 Colloquium 2006-10-25 R. Marti 3-1 DWh 2011: Data Warehouse Slide 60

Modified Star Schema Design


CountryNames
CountryId VTimBeg VTimEnd CountryName

Principle Add valid times in dimensions in the Data Warehouse using - an object table (Country) - a single property table (here: CountryNames) both with an associated valid time interval. Let foreign keys in fact tables refer to the unchanging ID in object tables. Generate standard Data Marts from this data model as needed, most often a history of measure according to the current dimensional structure.
3-1 DWh 2011: Data Warehouse Slide 61

Country
CountryId VTimBeg VTimEnd

CountrySuccession
Id -- original identifier SuccId -- direct successor CurrId -- ultimate successor

Population
CountryId Year

Year
R. Marti

Coping with a Distributed Environment (Teaser)


Note: Of course, in a global enterprise, all of this all happens in a distributed environment
Analytical Data Stores

Integration Data Stores History Stores (DWh) Exchange Stores (ODS)

Transactional Data Stores additional identifiers measures tied to ref data

e.g., Claims and Underwriting Systems

Master Data Stores identifiers dimensional attributes

e.g., MDM, CRM, ForEx, Geo DB

Flow of Master Data (e.g. Dimension Attributes + Values)


R. Marti 3-1 DWh 2011: Data Warehouse

Flow of Transactional Data


Slide 62

Kimballs Types of Slowly Changing Dimensions


Ralph Kimball proposed 3 (well actually 2 only) poor mans solutions to the historization of dimensions slowly changing dimensions (SCD) in the context of the Star Schema. SCD Type 1: no history of the dimensional attribute is needed simply overwrite the value
e.g. the correction of mistakes in names, birthdays etc.

SCD Type 2: versions of some dimensional attributes are needed store new records in the dimension table, with a new DWh identifier (ID), the existing stable source system ID, and the new (changed) values
e.g. a change in the rating of a client, or the new business division a profit center belongs to

SCD Type 3: current and original (or previous) versions are needed introduce a current and original attribute in the dimension table
e.g. the current rating and the original rating of each client
R. Marti 3-1 DWh 2011: Data Warehouse Slide 63

Slowly Changing Dimensions Type 1


Pros Simple to understand for business users and simple to implement (especially when using MOLAP tools) Requires the least space and has the best response time

Conses Simplicity for business users is deceiving A change in a dimensional attribute effectively changes the context for all facts captured prior to the change

R. Marti

3-1 DWh 2011: Data Warehouse

Slide 64

Slowly Changing Dimensions Type 2


Pros Reasonably understandable and simple to implement (regardless of MOLAP / ROLAP tool) Captures parts of the history Conses The time of a change in a dimension is not captured Requires more space since a single dimensional object is possibly represented in several rows (but this is usually not an issue) Can be confusing since changed dimensional data objects appear more than once, with identical source system IDs, but at least one changed attribute value Checking when it is ok to refer to which DWh IDs is not possible
R. Marti 3-1 DWh 2011: Data Warehouse Slide 65

Slowly Changing Dimensions Type 3


Pros Reasonably simple to implement (regardless of MOLAP / ROLAP tool) Captures parts of the history Conses Can only have 2 versions of any attribute (usually original and current) Each historized attribute A must be represented by 2 or 3 attributes (namely, A_Original, A_Current and possibly A_Valid_From) This requires more space This can also be confusing to some users Checking when it is ok to refer to which DWh IDs is not possible
R. Marti 3-1 DWh 2011: Data Warehouse Slide 66

Literature
General Temporal Database Concepts [Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann, 1999. (see http://www.cs.arizona.edu/people/rts/publications.html) [Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems Applications - DEXA '06, 2006. [Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. Morgan Kaufmann, 2010.

Data Warehouse Design [Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition. John Wiley, 2002. [Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003. [Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and Methodologies. McGraw Hill, 2009. [Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.
R. Marti 3-1 DWh 2011: Data Warehouse Slide 67

You might also like