Data War Eh House - 1

Database/OLTP(On-line transaction processing) systems A database is a collection of data associated with some organization or enterprise DBMS is the software
used to access a database ER model is often used to abstractly view data of DBMS Relational database is a collection of tables consisting of a set of attributes (fields) and a large set of tuples (records). Relational algebra provides a set of operations that can be performed on relations Database query language SQL is an excellent tool for extracting shallow knowledge from data while data mining discovers hidden knowledge If we know what you are looking for query language can be used manual data mining
4/30/2012 Data Warehousing 1
Data Warehousing/ Dimensional modeling/OLAP systems

A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a Subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process
Subject-oriented Organized around major subjects such as
customer, supplier, product sales etc.

Provide a simple and concise view around particular subject by excluding data that is not useful for decision process Modeled in accordance with decision makers needs and not transaction processing needs Integrated - Constructed by integrating multiple, heterogeneous data sources- relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
Time-variant -The time horizon for the data warehouse is

significantly longer than that of operational systems Operational database: current value data
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain time element
4/30/2012
Data Warehousing
Non-volatile A physically separate store of data transformed from the operational environment
Operational update of data does not occur in the data

warehouse environment Does not require transaction processing, recovery, and
concurrency control mechanisms

Requires only two operations in data accessing:
initial loading of data and
access of data
4/30/2012
Data Warehousing
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: A query driven approach

Build wrappers/mappings on top of heterogeneous databases A meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resource Data Warehouse : update-driven approach, high performance Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis

Query processing in DW does not interfere with the processing at local sources.
Advantages of having a Data warehouse Provides a competitive advantage by allowing performance measurement and critical adjustments which helps in winning over a competitor
Enhance business productivity as the information quickly available can be used to take corrective action
It facilitates customer relationship management by using the information across all lines of business, all departments and all markets. It brings about cost reduction by tracking trends, patterns and exceptions in a consistent and reliable manner To design a warehouse one need to understand and analyze business needs
OLTP (on-line transaction processing)
Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
OLTP
users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive
OLAP
knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc
usage
access
unit of work # records accessed
read/write index/hash on prim. key short, simple transaction

tens
lots of scans
complex query millions
#users
DB size metric
4/30/2012
thousands
100MB-GB transaction throughput
Data Warehousing
hundreds
100GB-TB query throughput, response
9
Why Separate Data Warehouse?

High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: Missing Data: Decision support requires historical data which operational DBs do not typically maintain Data Consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data Quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases
Multidimensional data modeling

A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions. Dimensions : are the perspectives/entities w. r. to which an organization wants to keep records e.g. sales DW - central theme : sales w. r. to time, item, branch and location Each dimension - has a table associated with it dimension table e.g., Dimension table for item contain attributes item_name, brand, type
- Theme : represented by fact table - Facts are numerical measures - e.g., amt_sold (total sales amount in Rs.), units_sold (total no. of units sold)
Note : We may display any n-D data as a series of (n-1) D Cubes Data cubes are n-dimensional & do not confine data to 3-D
In data warehousing literature an n-D base cube is called a base cuboid (lowest level of summarization) The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. (denoted by all) The lattice of cuboids forms a data cube. Each cuboid represents a different degree of summarization
Lattice of Cuboids
all time item location supplier
0-D(apex) cuboid
1-D cuboids
time,location time,item
item,location item,supplier
location,supplier
time,supplier
2-D cuboids
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Schemas for Multidimensional Databases : Star schema: 1. A large central table (fact table) containing the bulk of the data, with no redundancy 2. a set of smaller attendant tables (dimension tables), one for each dimension Dimension 3 Dimension 4
Dimension 1
Facts
Dimension 5
4/30/2012 Data Warehousing
Dimension 2
14
Example of Star Schema

time
time_key day day_of_the_week month quarter year
item
Sales Fact Table time_key item_key
item_key item_name brand type supplier_type
branch_key
branch
branch_key branch_name branch_type
location
location_key street city state_or_province country
location_key units_sold dollars_sold avg_sales Measures
4/30/2012
Data Warehousing
15
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake The item and location dimensions have been normalized to give rise to two more tables supplier and city
Snowflake vs. star

saving of storage space and easy to maintain (because of normalization) is negligible in comparison to the typical magnitude of the fact table Snowflake structure can reduce the effectiveness of browsing, since more joins are needed to execute a query
system performance may be adversely impacted

Snowflake is not as popular as the star schema in DW design
Example of Snowflake Schema

time
time_key day day_of_the_week month quarter year
supplier
supplier_key supplier_type
item
item_key item_name brand type supplier_key
Sales Fact Table time_key item_key branch_key
city
city_key city state_or_province country
branch
location_key units_sold location
dollars_sold
avg_sales
location_key street City_key
Measures
Fact constellations:
sophisticated applications may require multiple fact tables share dimension tables
viewed as a collection of stars, therefore called galaxy schema or fact constellation For DW Fact constellation schema is commonly used, since it can model multiple inter-related subjects For Data Mart star or snowflake schema are commonly used (star schema is more popular and efficient)
Example of Fact Constellation

item
time
time_key day day_of_the_week month quarter year Sales Fact Table item_key item_name brand type supplier_type
shipper
shipper_key shipper_name location_key shipper_type
time_key Sales Fact Table item_key
Shipping Fact Table
branch_key
location_key units_sold dollars_sold avg_sales location
location_key street city state_or_province country
Data Warehousing
time_key Sales Fact Table item_key

shipper_key
branch
From_location
To_Location dollars_cost units_shipped
Measures
4/30/2012
19
Advantages of Star Schema The star schema reflects exactly the way the decision makers thinks in terms of business metrics. Users understand the structures very easily It optimizes navigation through the database It is most suitable for query processing It allows query processor software to use better execution plans Disadvantages of Star Schema Dimension tables are not normalized leading to redundancy and inconsistency
In snowflake schema All tables are fully normalized Advantages of Snowflake Schema Small savings in storage space. Normalized structures are easier to update and maintain Disadvantages of Snowflake Schema Schema is less intuitive and end-users are affected by the complexity Difficult to browse through contents Degraded query performance because of additional joins
4/30/2012
Data Warehousing
21
Examples for defining Star, Snowflake, and Fact Constellation Schemas

data mining query language (DMQL) 2 language primitives
1. cube definition
Syntax : define cube <cube_name> [<dimension_list>] : <measure_list> 2. Dimension definition
Syntax :
define dimension <dimension_name> as
4/30/2012
(<attribute_or_dimension_list>)
Data Warehousing
22
Star Schema definition
Define cube sales_star [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)
Define dimension branch as (branch_key, branch_name, branch_type)

Define dimension location as (location_key, street, city, province_or_state, country)
4/30/2012
Data Warehousing
23
Snowflake Schema definition
Define cube sales_snowflake [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,supplier_type))

Define dimension location as (location_key, street,city(city_key, city,province_or_state,country))
Here item and location tables are normalized supplier_key, city_key is implicitly in the item, location dimensions respectively 4/30/2012 Data Warehousing 24
Fact constellation Schema definition
Define cube sales [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)

Define dimension location as (location_key, street, city, province_or_state, country)
4/30/2012
Data Warehousing
25
Define cube shipping [time, item, shipper, from_location, to_location] : dollars_cost = sum(cost_in_dollars), units_shipped = count(*)
Define dimension time as time in cube sales Define dimension item as item in cube sales Define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) Define dimension from_location as location in cube sales Define dimension to_location as location in cube sales
time, item, location dimensions of the sales cube are shared with the shipping cube
4/30/2012
Data Warehousing
27
4/30/2012
Data Warehousing
28
Inside a dimension table Dimension table key- Primary key that uniquely identifies rows in the table Table is wide-has many columns or attributes Fewer number of records- fewer rows than fact table-dimension table in hundreds facts in millions Textual attributes- attributes are of textual formatthey represent textual descriptions of a business dimension Attributes not directly related-Some attributes are not related to one another such as brand and supplier_type in tem dimension table but both are attributes of item
Not Normalized- For efficient query performance, it is best if the query takes attributes from dimension table and goes directly to the fact table. If dimension tables are normalized, query travels through intermediary tables affecting performance. Dimension tables are flattened out not normalized in star schema while some dimensions are normalized in snowflake schema Drilling down , Rolling up- The attributes in a dimension table provide the ability to get to the details from higher levels of aggregation to lower levels of details Get the total sales by city and drill down to total sales by zip or roll up to total sales by state
Multiple Hierarchies- Dimension tables often provide for multiple hierarchies so that drilling down may be performed along any of the multiple hierarchies The marketing department may have its own way of classifying items by defining item types while the accounting department may group items by defining its own item types. In this case the item table will contain both the attributes marketing_item_type and accounting_item_type
4/30/2012
Data Warehousing
31
Attribute Hierarchy
The attributes may be related by a total order or in some cases a partial order may exist between the attributes of a dimension
country
year
State
quarter
city street
month day
week
Some attribute hierarchies common to many applications are predefined ex. Time. While some are user-defined Attribute hierarchies may be provided by domain experts or generated automatically by statistical analysis of data distribution
Inside the fact Table Concatenated keys A row in the fact table relates to a combination of rows from all the dimension table. The primary key of the fact table is the concatenation of the primary keys of all the dimension tables Data Grain- The data grain is the level of detail for the measurement or metrics. The quantity sold can be of a particular item on a given date to a given customer(if customer dimension is also added) which is at a very detailed level. The data grain in this case is at a higher level
4/30/2012
Data Warehousing
33
Fully Additive measures The values can be summed up by simple addition Semi Additive measures- Derived attributes such as dollars earned per quantity is not additive. Distinguish semi additive measures from fully additive measures when performing aggregation in queries Table deep, not wide- Fewer attributes than a dimension table , but large number of rows . Fact table is narrow but deep. Fact table is spread vertically while dimension table is spread horizontally Sparse data-There could be combinations of dimension table attributes for which fact table entry may be null Degenerate dimensions- these are some attributes such sales_order_no which appear in fact table though these are not measures useful for analysis such as average number of items per order
Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Slice-selection on one dimension of cube resulting in a subcube Dice-selection operation on two or more dimensions Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations Drill across: involving (across) more than one fact table Drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
4/30/2012
Data Warehousing
36
Conceptual Modeling of Data Warehouses

Criteria for dimensional model The model should provide the best data access The whole model should be queri-centric It must be optimized for queries and analysis It must be structured in such a way that every dimension can interact equally with the fact table The model should allow drilling down or rolling up along dimension hierarchies Dimensions Time , location , item , branch Facts Quantity of sales, Sales Amount,
Conceptual Modeling of Data Warehouses

Design Decisions Choosing the process selecting the subjects for the first set of logical structures to be designed Choosing the grain- determining the level of detail for the data in the data structures Identifying and conforming the dimensionschoosing the dimensions to be included Choosing the facts-Selecting the metrics or units of measurements to be included Choosing the duration of the database- determining how far back in time one should go for historical data

Data War Eh House - 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data War Eh House - 1

Uploaded by

Copyright:

Available Formats

Database/OLTP(On-line transaction processing) systems A database is a collection of data associated with some organization or enterprise DBMS is the software

Data Warehousing/ Dimensional modeling/OLAP systems

Subject-oriented Organized around major subjects such as

customer, supplier, product sales etc.

Time-variant -The time horizon for the data warehouse is

Contains an element of time, explicitly or implicitly

Operational update of data does not occur in the data

concurrency control mechanisms

initial loading of data and

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: A query driven approach

and stored in warehouses for direct query and analysis

OLTP (on-line transaction processing)

Major task of traditional relational DBMS

OLAP (on-line analytical processing)

read/write index/hash on prim. key short, simple transaction

Why Separate Data Warehouse?

Multidimensional data modeling

all time item location supplier

Example of Star Schema

location_key units_sold dollars_sold avg_sales Measures

Snowflake vs. star

system performance may be adversely impacted

Example of Snowflake Schema

Sales Fact Table time_key item_key branch_key

location_key units_sold location

location_key street City_key

Example of Fact Constellation

time_key Sales Fact Table item_key

Shipping Fact Table

time_key Sales Fact Table item_key

Examples for defining Star, Snowflake, and Fact Constellation Schemas

Star Schema definition

Define dimension branch as (branch_key, branch_name, branch_type)

Snowflake Schema definition

Define dimension branch as (branch_key, branch_name, branch_type)

Fact constellation Schema definition

Define dimension branch as (branch_key, branch_name, branch_type)

Typical OLAP Operations

Conceptual Modeling of Data Warehouses

Conceptual Modeling of Data Warehouses

You might also like