Professional Documents
Culture Documents
used to access a database ER model is often used to abstractly view data of DBMS Relational database is a collection of tables consisting of a set of attributes (fields) and a large set of tuples (records). Relational algebra provides a set of operations that can be performed on relations Database query language SQL is an excellent tool for extracting shallow knowledge from data while data mining discovers hidden knowledge If we know what you are looking for query language can be used manual data mining
4/30/2012 Data Warehousing 1
Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
4/30/2012 Data Warehousing 3
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
4/30/2012
Data Warehousing
Non-volatile A physically separate store of data transformed from the operational environment
access of data
4/30/2012
Data Warehousing
Advantages of having a Data warehouse Provides a competitive advantage by allowing performance measurement and critical adjustments which helps in winning over a competitor
Enhance business productivity as the information quickly available can be used to take corrective action
It facilitates customer relationship management by using the information across all lines of business, all departments and all markets. It brings about cost reduction by tracking trends, patterns and exceptions in a consistent and reliable manner To design a warehouse one need to understand and analyze business needs
4/30/2012 Data Warehousing 7
OLTP
users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive
OLAP
knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc
usage
access
unit of work # records accessed
lots of scans
complex query millions
#users
DB size metric
4/30/2012
thousands
100MB-GB transaction throughput
Data Warehousing
hundreds
100GB-TB query throughput, response
9
- Theme : represented by fact table - Facts are numerical measures - e.g., amt_sold (total sales amount in Rs.), units_sold (total no. of units sold)
Note : We may display any n-D data as a series of (n-1) D Cubes Data cubes are n-dimensional & do not confine data to 3-D
In data warehousing literature an n-D base cube is called a base cuboid (lowest level of summarization) The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. (denoted by all) The lattice of cuboids forms a data cube. Each cuboid represents a different degree of summarization
4/30/2012 Data Warehousing 12
Lattice of Cuboids
0-D(apex) cuboid
1-D cuboids
time,location time,item
item,location item,supplier
location,supplier
time,supplier
2-D cuboids
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
4/30/2012 Data Warehousing 13
Schemas for Multidimensional Databases : Star schema: 1. A large central table (fact table) containing the bulk of the data, with no redundancy 2. a set of smaller attendant tables (dimension tables), one for each dimension Dimension 3 Dimension 4
Dimension 1
Facts
Dimension 5
4/30/2012 Data Warehousing
Dimension 2
14
item
Sales Fact Table time_key item_key
item_key item_name brand type supplier_type
branch_key
branch
branch_key branch_name branch_type
location
location_key street city state_or_province country
4/30/2012
Data Warehousing
15
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake The item and location dimensions have been normalized to give rise to two more tables supplier and city
supplier
supplier_key supplier_type
item
item_key item_name brand type supplier_key
city
city_key city state_or_province country
branch
branch_key branch_name branch_type
dollars_sold
avg_sales
Measures
4/30/2012 Data Warehousing 17
Fact constellations:
sophisticated applications may require multiple fact tables share dimension tables
viewed as a collection of stars, therefore called galaxy schema or fact constellation For DW Fact constellation schema is commonly used, since it can model multiple inter-related subjects For Data Mart star or snowflake schema are commonly used (star schema is more popular and efficient)
4/30/2012 Data Warehousing 18
shipper
shipper_key shipper_name location_key shipper_type
branch_key
location_key units_sold dollars_sold avg_sales location
location_key street city state_or_province country
Data Warehousing
branch
branch_key branch_name branch_type
From_location
To_Location dollars_cost units_shipped
Measures
4/30/2012
19
Advantages of Star Schema The star schema reflects exactly the way the decision makers thinks in terms of business metrics. Users understand the structures very easily It optimizes navigation through the database It is most suitable for query processing It allows query processor software to use better execution plans Disadvantages of Star Schema Dimension tables are not normalized leading to redundancy and inconsistency
4/30/2012 Data Warehousing 20
In snowflake schema All tables are fully normalized Advantages of Snowflake Schema Small savings in storage space. Normalized structures are easier to update and maintain Disadvantages of Snowflake Schema Schema is less intuitive and end-users are affected by the complexity Difficult to browse through contents Degraded query performance because of additional joins
4/30/2012
Data Warehousing
21
1. cube definition
Syntax : define cube <cube_name> [<dimension_list>] : <measure_list> 2. Dimension definition
Syntax :
define dimension <dimension_name> as
4/30/2012
(<attribute_or_dimension_list>)
Data Warehousing
22
Define cube sales_star [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)
4/30/2012
Data Warehousing
23
Define cube sales_snowflake [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,supplier_type))
Define cube sales [time, item, branch, location] : dollars_sold = sum(sales_in_dollars), units_sold = count(*)
Define dimension time as (time_key, day, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier_type)
4/30/2012
Data Warehousing
25
Define cube shipping [time, item, shipper, from_location, to_location] : dollars_cost = sum(cost_in_dollars), units_shipped = count(*)
Define dimension time as time in cube sales Define dimension item as item in cube sales Define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) Define dimension from_location as location in cube sales Define dimension to_location as location in cube sales
time, item, location dimensions of the sales cube are shared with the shipping cube
4/30/2012 Data Warehousing 26
4/30/2012
Data Warehousing
27
4/30/2012
Data Warehousing
28
Inside a dimension table Dimension table key- Primary key that uniquely identifies rows in the table Table is wide-has many columns or attributes Fewer number of records- fewer rows than fact table-dimension table in hundreds facts in millions Textual attributes- attributes are of textual formatthey represent textual descriptions of a business dimension Attributes not directly related-Some attributes are not related to one another such as brand and supplier_type in tem dimension table but both are attributes of item
4/30/2012 Data Warehousing 29
Not Normalized- For efficient query performance, it is best if the query takes attributes from dimension table and goes directly to the fact table. If dimension tables are normalized, query travels through intermediary tables affecting performance. Dimension tables are flattened out not normalized in star schema while some dimensions are normalized in snowflake schema Drilling down , Rolling up- The attributes in a dimension table provide the ability to get to the details from higher levels of aggregation to lower levels of details Get the total sales by city and drill down to total sales by zip or roll up to total sales by state
4/30/2012 Data Warehousing 30
Multiple Hierarchies- Dimension tables often provide for multiple hierarchies so that drilling down may be performed along any of the multiple hierarchies The marketing department may have its own way of classifying items by defining item types while the accounting department may group items by defining its own item types. In this case the item table will contain both the attributes marketing_item_type and accounting_item_type
4/30/2012
Data Warehousing
31
Attribute Hierarchy
The attributes may be related by a total order or in some cases a partial order may exist between the attributes of a dimension
country
year
State
quarter
city street
month day
week
Some attribute hierarchies common to many applications are predefined ex. Time. While some are user-defined Attribute hierarchies may be provided by domain experts or generated automatically by statistical analysis of data distribution
4/30/2012 Data Warehousing 32
Inside the fact Table Concatenated keys A row in the fact table relates to a combination of rows from all the dimension table. The primary key of the fact table is the concatenation of the primary keys of all the dimension tables Data Grain- The data grain is the level of detail for the measurement or metrics. The quantity sold can be of a particular item on a given date to a given customer(if customer dimension is also added) which is at a very detailed level. The data grain in this case is at a higher level
4/30/2012
Data Warehousing
33
Fully Additive measures The values can be summed up by simple addition Semi Additive measures- Derived attributes such as dollars earned per quantity is not additive. Distinguish semi additive measures from fully additive measures when performing aggregation in queries Table deep, not wide- Fewer attributes than a dimension table , but large number of rows . Fact table is narrow but deep. Fact table is spread vertically while dimension table is spread horizontally Sparse data-There could be combinations of dimension table attributes for which fact table entry may be null Degenerate dimensions- these are some attributes such sales_order_no which appear in fact table though these are not measures useful for analysis such as average number of items per order
4/30/2012 Data Warehousing 34
4/30/2012
Data Warehousing
36