You are on page 1of 38

Database/OLTP(On-line transaction processing) systems A database is a collection of data associated with some organization or enterprise DBMS is the software

used to access a database ER model is often used to abstractly view data of DBMS Relational database is a collection of tables consisting of a set of attributes (fields) and a large set of tuples (records). Relational algebra provides a set of operations that can be performed on relations Database query language SQL is an excellent tool for extracting shallow knowledge from data while data mining discovers hidden knowledge If we know what you are looking for query language can be used manual data mining
4/30/2012 Data Warehousing 1

Data Warehousing/ Dimensional modeling/OLAP systems


A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a Subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process
4/30/2012 Data Warehousing 2

Subject-oriented Organized around major subjects such as

customer, supplier, product sales etc.


Provide a simple and concise view around particular subject by excluding data that is not useful for decision process Modeled in accordance with decision makers needs and not transaction processing needs Integrated - Constructed by integrating multiple, heterogeneous data sources- relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
4/30/2012 Data Warehousing 3

Time-variant -The time horizon for the data warehouse is


significantly longer than that of operational systems Operational database: current value data

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly


But the key of operational data may or may not contain time element

4/30/2012

Data Warehousing

Non-volatile A physically separate store of data transformed from the operational environment

Operational update of data does not occur in the data


warehouse environment Does not require transaction processing, recovery, and

concurrency control mechanisms


Requires only two operations in data accessing:

initial loading of data and

access of data

4/30/2012

Data Warehousing

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: A query driven approach


Build wrappers/mappings on top of heterogeneous databases A meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resource Data Warehouse : update-driven approach, high performance Information from heterogeneous sources is integrated in advance

and stored in warehouses for direct query and analysis


Query processing in DW does not interfere with the processing at local sources.
4/30/2012 Data Warehousing 6

Advantages of having a Data warehouse Provides a competitive advantage by allowing performance measurement and critical adjustments which helps in winning over a competitor

Enhance business productivity as the information quickly available can be used to take corrective action
It facilitates customer relationship management by using the information across all lines of business, all departments and all markets. It brings about cost reduction by tracking trends, patterns and exceptions in a consistent and reliable manner To design a warehouse one need to understand and analyze business needs
4/30/2012 Data Warehousing 7

OLTP (on-line transaction processing)

Major task of traditional relational DBMS


Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)


Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
4/30/2012 Data Warehousing 8

OLTP
users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive

OLAP
knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc

usage

access
unit of work # records accessed

read/write index/hash on prim. key short, simple transaction


tens

lots of scans
complex query millions

#users
DB size metric
4/30/2012

thousands
100MB-GB transaction throughput
Data Warehousing

hundreds
100GB-TB query throughput, response
9

Why Separate Data Warehouse?


High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: Missing Data: Decision support requires historical data which operational DBs do not typically maintain Data Consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data Quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases
4/30/2012 Data Warehousing 10

Multidimensional data modeling


A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions. Dimensions : are the perspectives/entities w. r. to which an organization wants to keep records e.g. sales DW - central theme : sales w. r. to time, item, branch and location Each dimension - has a table associated with it dimension table e.g., Dimension table for item contain attributes item_name, brand, type
4/30/2012 Data Warehousing 11

- Theme : represented by fact table - Facts are numerical measures - e.g., amt_sold (total sales amount in Rs.), units_sold (total no. of units sold)

Note : We may display any n-D data as a series of (n-1) D Cubes Data cubes are n-dimensional & do not confine data to 3-D
In data warehousing literature an n-D base cube is called a base cuboid (lowest level of summarization) The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. (denoted by all) The lattice of cuboids forms a data cube. Each cuboid represents a different degree of summarization
4/30/2012 Data Warehousing 12

Lattice of Cuboids

all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,location time,item

item,location item,supplier

location,supplier

time,supplier

2-D cuboids

time,location,supplier

3-D cuboids
time,item,location
time,item,supplier

item,location,supplier

4-D(base) cuboid
time, item, location, supplier
4/30/2012 Data Warehousing 13

Schemas for Multidimensional Databases : Star schema: 1. A large central table (fact table) containing the bulk of the data, with no redundancy 2. a set of smaller attendant tables (dimension tables), one for each dimension Dimension 3 Dimension 4

Dimension 1
Facts

Dimension 5
4/30/2012 Data Warehousing

Dimension 2
14

Example of Star Schema


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table time_key item_key
item_key item_name brand type supplier_type

branch_key

branch
branch_key branch_name branch_type

location
location_key street city state_or_province country

location_key units_sold dollars_sold avg_sales Measures

4/30/2012

Data Warehousing

15

Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake The item and location dimensions have been normalized to give rise to two more tables supplier and city

Snowflake vs. star


saving of storage space and easy to maintain (because of normalization) is negligible in comparison to the typical magnitude of the fact table Snowflake structure can reduce the effectiveness of browsing, since more joins are needed to execute a query

system performance may be adversely impacted


Snowflake is not as popular as the star schema in DW design
4/30/2012 Data Warehousing 16

Example of Snowflake Schema


time
time_key day day_of_the_week month quarter year

supplier
supplier_key supplier_type

item
item_key item_name brand type supplier_key

Sales Fact Table time_key item_key branch_key

city
city_key city state_or_province country

branch
branch_key branch_name branch_type

location_key units_sold location

dollars_sold
avg_sales

location_key street City_key

Measures
4/30/2012 Data Warehousing 17

Fact constellations:

sophisticated applications may require multiple fact tables share dimension tables
viewed as a collection of stars, therefore called galaxy schema or fact constellation For DW Fact constellation schema is commonly used, since it can model multiple inter-related subjects For Data Mart star or snowflake schema are commonly used (star schema is more popular and efficient)
4/30/2012 Data Warehousing 18

Example of Fact Constellation


item
time
time_key day day_of_the_week month quarter year Sales Fact Table item_key item_name brand type supplier_type

shipper
shipper_key shipper_name location_key shipper_type

time_key Sales Fact Table item_key

Shipping Fact Table

branch_key
location_key units_sold dollars_sold avg_sales location
location_key street city state_or_province country
Data Warehousing

time_key Sales Fact Table item_key


shipper_key

branch
branch_key branch_name branch_type

From_location
To_Location dollars_cost units_shipped

Measures

4/30/2012

19

Advantages of Star Schema The star schema reflects exactly the way the decision makers thinks in terms of business metrics. Users understand the structures very easily It optimizes navigation through the database It is most suitable for query processing It allows query processor software to use better execution plans Disadvantages of Star Schema Dimension tables are not normalized leading to redundancy and inconsistency
4/30/2012 Data Warehousing 20

In snowflake schema All tables are fully normalized Advantages of Snowflake Schema Small savings in storage space. Normalized structures are easier to update and maintain Disadvantages of Snowflake Schema Schema is less intuitive and end-users are affected by the complexity Difficult to browse through contents Degraded query performance because of additional joins

4/30/2012

Data Warehousing

21

Examples for defining Star, Snowflake, and Fact Constellation Schemas


data mining query language (DMQL) 2 language primitives

1. cube definition
Syntax : define cube <cube_name> [<dimension_list>] : <measure_list> 2. Dimension definition

Syntax :
define dimension <dimension_name> as
4/30/2012

(<attribute_or_dimension_list>)
Data Warehousing

22

Star Schema definition

Define cube sales_star [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)

Define dimension branch as (branch_key, branch_name, branch_type)


Define dimension location as (location_key, street, city, province_or_state, country)

4/30/2012

Data Warehousing

23

Snowflake Schema definition

Define cube sales_snowflake [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,supplier_type))

Define dimension branch as (branch_key, branch_name, branch_type)


Define dimension location as (location_key, street,city(city_key, city,province_or_state,country))
Here item and location tables are normalized supplier_key, city_key is implicitly in the item, location dimensions respectively 4/30/2012 Data Warehousing 24

Fact constellation Schema definition

Define cube sales [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)

Define dimension branch as (branch_key, branch_name, branch_type)


Define dimension location as (location_key, street, city, province_or_state, country)

4/30/2012

Data Warehousing

25

Define cube shipping [time, item, shipper, from_location, to_location] : dollars_cost = sum(cost_in_dollars), units_shipped = count(*)
Define dimension time as time in cube sales Define dimension item as item in cube sales Define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) Define dimension from_location as location in cube sales Define dimension to_location as location in cube sales
time, item, location dimensions of the sales cube are shared with the shipping cube
4/30/2012 Data Warehousing 26

4/30/2012

Data Warehousing

27

4/30/2012

Data Warehousing

28

Inside a dimension table Dimension table key- Primary key that uniquely identifies rows in the table Table is wide-has many columns or attributes Fewer number of records- fewer rows than fact table-dimension table in hundreds facts in millions Textual attributes- attributes are of textual formatthey represent textual descriptions of a business dimension Attributes not directly related-Some attributes are not related to one another such as brand and supplier_type in tem dimension table but both are attributes of item
4/30/2012 Data Warehousing 29

Not Normalized- For efficient query performance, it is best if the query takes attributes from dimension table and goes directly to the fact table. If dimension tables are normalized, query travels through intermediary tables affecting performance. Dimension tables are flattened out not normalized in star schema while some dimensions are normalized in snowflake schema Drilling down , Rolling up- The attributes in a dimension table provide the ability to get to the details from higher levels of aggregation to lower levels of details Get the total sales by city and drill down to total sales by zip or roll up to total sales by state
4/30/2012 Data Warehousing 30

Multiple Hierarchies- Dimension tables often provide for multiple hierarchies so that drilling down may be performed along any of the multiple hierarchies The marketing department may have its own way of classifying items by defining item types while the accounting department may group items by defining its own item types. In this case the item table will contain both the attributes marketing_item_type and accounting_item_type

4/30/2012

Data Warehousing

31

Attribute Hierarchy
The attributes may be related by a total order or in some cases a partial order may exist between the attributes of a dimension

country

year

State

quarter

city street

month day

week

Some attribute hierarchies common to many applications are predefined ex. Time. While some are user-defined Attribute hierarchies may be provided by domain experts or generated automatically by statistical analysis of data distribution
4/30/2012 Data Warehousing 32

Inside the fact Table Concatenated keys A row in the fact table relates to a combination of rows from all the dimension table. The primary key of the fact table is the concatenation of the primary keys of all the dimension tables Data Grain- The data grain is the level of detail for the measurement or metrics. The quantity sold can be of a particular item on a given date to a given customer(if customer dimension is also added) which is at a very detailed level. The data grain in this case is at a higher level

4/30/2012

Data Warehousing

33

Fully Additive measures The values can be summed up by simple addition Semi Additive measures- Derived attributes such as dollars earned per quantity is not additive. Distinguish semi additive measures from fully additive measures when performing aggregation in queries Table deep, not wide- Fewer attributes than a dimension table , but large number of rows . Fact table is narrow but deep. Fact table is spread vertically while dimension table is spread horizontally Sparse data-There could be combinations of dimension table attributes for which fact table entry may be null Degenerate dimensions- these are some attributes such sales_order_no which appear in fact table though these are not measures useful for analysis such as average number of items per order
4/30/2012 Data Warehousing 34

Typical OLAP Operations


Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Slice-selection on one dimension of cube resulting in a subcube Dice-selection operation on two or more dimensions Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations Drill across: involving (across) more than one fact table Drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
4/30/2012 Data Warehousing 35

4/30/2012

Data Warehousing

36

Conceptual Modeling of Data Warehouses


Criteria for dimensional model The model should provide the best data access The whole model should be queri-centric It must be optimized for queries and analysis It must be structured in such a way that every dimension can interact equally with the fact table The model should allow drilling down or rolling up along dimension hierarchies Dimensions Time , location , item , branch Facts Quantity of sales, Sales Amount,
4/30/2012 Data Warehousing 37

Conceptual Modeling of Data Warehouses


Design Decisions Choosing the process selecting the subjects for the first set of logical structures to be designed Choosing the grain- determining the level of detail for the data in the data structures Identifying and conforming the dimensionschoosing the dimensions to be included Choosing the facts-Selecting the metrics or units of measurements to be included Choosing the duration of the database- determining how far back in time one should go for historical data
4/30/2012 Data Warehousing 38

You might also like