Professional Documents
Culture Documents
Why integrated?
a data warehouse is usually constructed by integrating multiple heterogeneous data sources
Why time-variant?
data are stored to provide information from a historical perspective (5-10 years in data warehouse vs. 30-90 days in operational databases)
Why nonvolatile?
update of data does not occur frequently in a data warehouse (without transaction processing, recovery and concurrency control mechanisms)
Data warehouse
On-line analytical processing (OLAP)
Data analysis and decision making
short, simple transaction complex query tens thousands 100MB-GB transaction throughput millions hundreds 100GB-TB query throughput, response time
time Q1 Q2 Q3 Q4
0-D(apex) cuboid
time item location supplier
1-D cuboids
time,item
time,location time,supplier
item,location
location,supplier
item,supplier
2-D cuboids
time,item,location
3-D cuboids
4-D(base) cuboid
country
province_or_state
city
Continent
Country City Street
street < city < country < continent
Year
Quarter Month Week Day
day < {month < quarter; week} < year
There may be more than one concept hierarchy for a given dimension, based on different user viewpoints
pivot
Star schema
A fact table in the middle connected to a set of dimension tables time
time_key day day_of_the_week month quarter year
item
item_key item_name brand type supplier_type supplier_name
branch
branch_key branch_name branch_type
branch_key
location_key dollars_sold units_sold
location
location_key street city province_or_state country
Measures
Snowflake schema
time A variant of star schema where some dimension tables are normalized into a set of smaller dimension tables
save storage sacrifice efficiency
item
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type supplier_name
branch
branch_key branch_name branch_type
location
location_key street city_key
city
city_key city province_or_street country
Measures
A compromise between star schema and snowflake schema is to adopt a mixed schema where only the very large dimension tables are normalized
shipper_key
item Sales Fact time_key item_key branch_key location_key dollars_sold
item_key item_name brand type supplier_key
branch
branch_key branch_name branch_type
location
location_key street city province_or_state country
shipper
shipper_key shipper_name location_key shipper_type
Measures
units_sold
Measures (I)
A measure is a numerical function that can be evaluated at each point in the data space
Measures (II)
Three categories of measures:
Distributive: if it is obtained by applying a distributive aggregate
function
e.g. count( ), sum( ), min( ), max( )
Operational DBs
Data Warehouse
Serve
Data Marts
Data Sources
Why precomputation?
OLAP engines demand that decision support queries be answered in the order of seconds
Precomputation of a data cube leads to fast response time and avoids some redundant computation
Data cube can be viewed as a lattice of cuboids
How many cuboids are there in an n-D data cube? If there are no hierarchies
T 2n
If each dimension is associated with Li levels of hierarchy
n T ( Li 1) i 1
full materialization
huge storage
partial materialization
trade-off
C c2 c3
c0 b3 c1
61 62 63 64 45 46 47 48 29 30 31 32 14 10 6 2 a1 15 11 7 3 a2 16 12 8 4 a3 60 44 28 56 40 24 36 52 20
B 13
9 5 1 a0
Suppose the size of the array for each dimension, A, B, and C, is 40, 400, and 4000
b2 b1 b0
Fact Table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Type Retail Dealer Dealer Retail Dealer
Index on Region
RecID Asia Europe America
Index on Type
0 0 0 1 0
RecID Retail Dealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1
1 2 3 4 5
1 0 1 0 0
0 1 0 0 1
comparison, join, and aggregation operations are reduced to bit-arithmetic operations significant reduction in space and I/O since a string of characters can be represented by a single bit
In star schema model of data warehouses, join index relates the values of the dimensions to rows in the fact table. e.g. a fact table sales and two dimensions location and item A join index on location maintains for each distinct street a list of T-ids of the tuples recording the sales in the street Join indices can span multiple dimensions
Metadata
Metadata are data about data
In data warehouse, metadata are the data that define warehouse objects
Function
OLAP functions are essentially for user-driven data summary and comparison data mining covers a much broader spectrum of functions
Data
OLAP is used to analyze data in data warehouse data mining is not confined to analyze data in data warehouse
An OLAM architecture
Mining query Mining result
User GUI API Layer4 User Interface
OLAM Engine
Data Cube API
OLAP Engine
Layer3 OLAP/OLAM
Data Cube
cleaning&integration
Layer2
Meta Data
Data Cube
Database API
cleaning integration
filtering
Databases
Data Warehouse
Can
Characterization: provide a concise and succinct summarization of the given collection of data Comparison (Discrimination): provide descriptions comparing two or more collections of data The
simplest kind of descriptive data mining; can also be regarded as extended OLAP
in its initial proposal, AOI is a relational database query-oriented, generalization-based, online data analysis technique; now data cube
and offline precomputation can also be used
Attribute-oriented induction
General idea:
collect the task-relevant data perform generalization by attribute removal or attribute generalization apply aggregation by merging identical, generalized tuples and accumulating their respective counts
Sketch of AOI
Data focusing
the specification of task-relevant data, resulting in the initial working relation
Data generalization
attribute removal
if there is a large set of distinct values for an attribute, but either (1) there is no generalization operator on the attribute, or (2) its higher level concepts are expressed in terms of other attributes
attribute generalization
if there is a large set of distinct values for an attribute, and there exists a set of generalization operators on the attribute
Presentation
Example of AOI
Initial relation
name Jim Woodman Scott Lachance Laura Lee removed gender M M F retained major CS CS Physics Sci, Eng, Business birth-place Vancouver, BC, Canada Montreal, Que, Canada Seattle, WA, USA country birth_date 8-12-88 28-7-89 25-8-84 age_ range residence 3511 Main St., Richmond 345 1st Ave., Richmond 125 Austin Ave., Burnaby city phone # 687-4598 253-9106 420-5232 removed gpa 3.67 3.70 3.83 Excellent ...
gender M F
count 16 22
a logic rule that is associated with quantitative information is called a quantitative rule
conditionn ( X ) [t : wn ]
X , graduate _ student ( X ) birth _ country( X ) = "Canada " age _ range( X ) = "25 30" gpa( X ) = " good " [ d : 30%]
Discriminant rule
sufficient condition of the target class condport1 d _ weight1 +
where condporti is the portion of the tuples covered by the i-th antecedents of the rule, tclass_port is the portion of the tuples belonging to the target class
example:
location / item Europe North_America both_regions count 80 120 200 TV t-weight 25% 17.65% 20% d-weight 40% 60% 100% count 240 560 800 computer t-weight 75% 82.35% 80% d-weight 30% 70% 100% count 320 680 1000 both_items t-weight 100% 100% 100% d-weight 32% 68% 100%
X , Europe ( X )
Incremental mining
revision based on newly added data DB generalize DB to the same level of abstraction in the generalized relation R to derive R union R R, i.e. merge counts and other statistical information to produce a new relation R
such an idea can also be applied to parallel or distributed mining