Data Warehousing Notes

Dimensional Modeling Concept
Dimensional Model is a logical design technique that seeks to present the data in a standard, intuitive
framework that allows for high-performance access. It is inherently dimensional, and it adheres to a
discipline that uses the relational model with some important restrictions. Every dimensional model is
composed of one table with a multi-part key, called the fact table, and a set of smaller tables called
dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the
components of the multi-part key in the fact table. (See Figure) This characteristic 'star-like' structure is often
called a star join.
A fact table, because it has a multi-part primary key made up of two OR more foreign keys, always
expresses a many-to-many relationship. The most useful fact tables also contain one OR more numerical
measures, OR 'facts,' that occur for the combination of keys that define each record. In Figure, the facts are
Units_Sold, Dollars_Sold, and Avg_sales. The most useful facts in a fact table are numeric and additive.
Additivity is crucial because data warehouse applications almost never retrieve a single fact table record;
rather, they fetch back hundreds, thousands, OR even millions of these records at a time, and the only
useful thing to do with so many records is to add them up.
Dimension tables, by contrast, most often contain descriptive textual information, and the attributes (also
called classification attributes), which are used for analysis. Dimension attributes are used as the source of
most of the interesting constraints in data warehouse queries, and they are virtually always the source of the
row headers in the SQL answer set.
Fact Table and Dimension Tables in a Dimensional Model Schema

Lets consider a Data-Warehouse cube. This cube has 4 dimensions and three measures. This means that
for every value of each of these 4 dimensions there will two values of coordinates. For example:
Co-ordinate [City(X), Product(Y), channel(Z),Month] = [ Sales (Quantity), Sales (Value)]
OR [NY, Standard Desk-top, Mail, September 2005] = [2000 units, $15000]
In the dimensional modeling schema, the FACT table contains the value of coordinates against the lowest
granularity of all the possible combinations of dimensions. The dimension tables contain the details of the
dimensions, which include the attributes of dimensions including all the higher-level hierarchies. The link
between the fact table and all the associated dimension tables is through a dimension key, which is the
lowest level granularity primary key of the dimension tables.
Fact Table- The central linkage in Dimensional Modeling
A fact table contains the value of all the measures linked to the set of dimensions linked to the FACT
table. It contains the measure values for the combination of lowest level of granularity of dimensions. The
measures are typically numeric, which can undergo mathematical aggregation and analysis.
Families of FACT Tables
● Chains and Circles.
● Heterogeneous products.
● Transactions and snapshots.
● Aggregates
Dimension Table- What does and should it contain

The dimension table contains all the information on the dimension. This includes:
a. The primary key (Equivalent foreign key in the Fact Table).
b. All attributes of the dimension. These include:
● The hierarchy attributes- Consider a business hierarchy-- pin-code to city to district to state
to country for location dimension. This means that each hierarchy element will be an
attribute.
● Textual as well as the code attributes- Location code as well as the name of the location. This is
required, because both could be used for different reasons by different users. A power user could
be looking for location code (NY01), whereas an end user could be looking for more explicit header
(New Jersey).
● Include all parallel hierarchies – A product could be having different hierarchies, depending upon
if CFO OR Head of sales is looking at it. This enables the done on all hierarchies as well as cross-
hierarchies.
● Production Primary Key Refer Surrogate primary key link to FACT table– These keys are
used because the production keys could change OR could be reused. For example a bill number
could be reused after 5 years, OR a part number (especially FMCG) could be reused after few
years.
● Production OR source system key- This is required for audit ability OR link to the Extraction
data and source systems.
Dimensional Model Schemas- Star, Snow-Flake and Constellation

Dimensional model can be organized in star-schema or snow-flaked schema.
Dimensional Model Star Schema using Star Query

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagra
resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points o
dimension tables.
A star schema is characterized by one OR more very large fact tables that contain the primary information in the data warehouse,
much smaller dimension tables (OR lookup tables), each of which contains information about the entries for a particular attribute in t
A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact tabl
key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star querie
efficient execution plans for them.
A typical fact table contains keys and measures. For example, in the sample schema, the fact table, sales, contain the measu
amount, and average, and the keys time_key, item-key, branch_key, and location_key. The dimension tables are time, branch, item
A star join is a primary key to foreign key join of the dimension tables to a fact table.
The main advantages of star schemas are that they:
● Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design.
● Provide highly optimized performance for typical star queries.
● Are widely supported by a large number of business intelligence tools, which may anticipate OR even require that the
schema contains dimension tables
Snow-Flake Schema in Dimensional Modeling

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a s
because the diagram of the schema resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tab
large table. For example, a location dimension table in a star schema might be normalized into a location table and city tab
schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result
queries and reduced query performance. Figure above presents a graphical representation of a snowflake schema.
Fact Constellation Schema

This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The s
done only when we want to focus on aggregation over few facts & dimensions.
Data Warehouse Dimensional Model Components Concept Dimensional Modeling vs. Relationa
Dimensional Modeling vs. Relational Modeling

Dimensional modeling is different from the OLTP normalized modeling to enable analysis and querying through massive a
queries. Something which is a relational model is ill-equipped to handle.
How Dimensional model is different from an E-R diagram?

● An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas d
aggregates most of the attributes and hierarchies of a dimension into a single entity.
● An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logi
star-schemas.
● The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts.
● In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'd
dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes.
Dimensional modeling is a better approach for Data warehouse compared to standard Data Model.
The dimensional model has a number of important data warehouse advantages that the ER model lacks.
First advantage of the dimensional model is that there are standard type of joins and framework. All dimensions can be thought of a
equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces ar
the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical. In other words,
● You will never find attributes in fact tables and facts in dimension tables.
● If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table
Second advantage of the dimensional model is that it is smoothly extensible to accommodate unexpected new data elements and n
decisions. First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table. D
have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old app
to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data ware
running by:
● Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with
grain of the existing fact table.
● Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact recor
● Adding new, unanticipated dimensional attributes.
● Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.
Third advantage of the dimensional model is that there is a body of standard approaches for handling common modeling situations
world. Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query
user interfaces. These modeling situations include:
● Slowly changing dimensions, where a 'constant' dimension such as Product OR Customer actually evolves slowly and
Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business e
● Heterogeneous products, where a business such as a bank needs to:
○ track a number of different lines of business together within a single common set of attributes and facts, but at the
○ it needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible m
Data Warehousing - Dimensions & Measures and Related Concepts
Each data warehouse consists of dimensions and measures. Dimensions allow data analysis from
various perspectives. For example, time dimension could show you the breakdown of sales by
year, quarter, month, day and hour. Product dimension could help you see which products bring in
the most revenue. Supplier dimension could help you choose those business partners who always
deliver their goods on time. Customer dimension could help you pick the strategic set of
consumers to whom you'd like to extend your very special offers.
Measures are numeric representations of a set of facts that have occurred. Examples of measures
include dollars of sales, number of credit hours, store profit percentage, dollars of operating
expenses, number of past-due accounts and so forth.
Additivity of Measures-Facts
Additivity and correct aggregation methods application is fundamental to the success of Business
Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right
Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of
business dimensional hierarchies, before you go through this chapter. Additivity of a measure is when you
are able to apply the sum operator across all the dimensions. Other aggregations on measures-facts are
when you use operators like Average, Maximum and Minimum.
Non-Additive Measures-Facts
Non-Additivity is that when you cannot use a sum operator to generate the needed aggregation.
Semi-Additive Measures-Facts
Semi-Additivity is when you can have a measure aggregated on a certain dimension, but not all the
dimensions. Another phrase for semi-additivity is when you have the summarization with an index of in-
accuracy.
Additive measures are measures that can be added across all dimensions. For example dollars of
sales can be added across all dimensions within a retail store warehouse.
Semi-additive measures are measures that can be added across some, but not all dimensions. For
example the bank account balance is simply a snapshot in time and cannot be summed over time.
However you could add multiple accounts of the same customer to get the total balance for that
customer.
Non-additive measures are measures that cannot be added across any dimensions. For example the
inventory is simply a snapshot in time and cannot be summed over time. Nor can you combine
inventory for various products.
Hierarchy defines parent-child relationships among various levels within a single dimension. For
instance in a time dimension, year level is parent of four quarters, each of which is a parent of
three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a
geography dimension a continent is a parent of countries, country could be a parent of states, and
state could be a parent of cities.
Level is a column within a dimension table that could be used for aggregating data. For example,
product dimension could have levels of product type (beverage), product category (alcoholic
beverage), product class (beer), product name (miller lite, budlite, corona, etc).
Member is a value within a dimension level that can be used for aggregating and reporting data.
For example each product category such as beverage, non-consumable, food, clothing, etc is a
member. Each product class such as beer, wine, coke, bottled water would represent a member.
Data Mart is a subset of the data warehouse typically serving a functional area such as marketing
or finance, or particular location of the business (for instance mid-Western division).
Jump to: navigation, search
Data Warehousing - Fact and Dimension Tables

Data warehouses are built using dimensional data models which consist of fact and dimension tables.
Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For
example, the time dimension would contain every hour, day, week, month, quarter and year that has
occurred since you started your business operations. Product dimension could contain a name and
description of products you sell, their unit price, color, weight and other attributes as applicable.
Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions
can grow fairly large, however. For example, a large credit card company could have a customer dimension
with millions of rows. Dimension table structure is typically very lean, for example customer dimension could
look like following:
Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country
Although there might be other attributes that you store in the relational database, data warehouses
might not need all of those attributes. For example, customer telephone numbers, email addresses
and other contact information would not be necessary for the warehouse. Keep in mind that data
warehouses are used to make strategic decisions by analyzing trends. It is not meant to be a tool
for daily business operations. On the other hand, you might have some reports that do include
data elements that aren't necessary for data analysis.
Most data warehouses will have one or multiple time dimensions. Since the warehouse will be
used for finding and examining trends, data analysts will need to know when each fact has
occurred. The most common time dimension is calendar time. However, your business might also
need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar
year.
Most data warehouses will also contain product or service dimensions since each business
typically operates by offering either products or services to others. Geographically dispersed
businesses are likely to have a location dimension.
Fact tables contain keys to dimension tables as well as measurable facts that data analysts would
want to examine. For example, a store selling automotive parts might have a fact table recording a
sale of each item. The fact table of an educational entity could track credit hours awarded to
students. A bakery could have a fact table that records manufacturing of various baked goods.
Fact tables can grow very large, with millions or even billions of rows. It is important to identify the
lowest level of facts that makes sense to analyze for your business this is often referred to as fact
table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues
by month; daily and hourly data might not exist or might not be relevant. On the other hand, the
assembly line warehouse analysts might be very concerned in number of defective goods that
were manufactured each hour. Similarly a marketing data warehouse might be concerned by the
activity of a consumer group with a specific income-level rather than purchases made by each
individual.
Data Warehousing - Star and Snowflake Schemas

The foundation of each data warehouse is a relational database built using a dimensional model. A
dimensional model consists of dimension and fact tables and is typically described as star or snowflake
schema.
Star schema resembles a star; one or more fact tables are surrounded by the dimension tables. Dimension
tables aren't normalized - that means even if you have repeating fields such as name or category no extra
table is added to remove the redundancy. For example, in a car dealership scenario you might have a
product dimension that might look like this:
Product_key
Product_category
Product_subcategory
Product_brand
Product_make
Product_model
Product_year
In a relational system such design would be clearly unacceptable because product category (car, van, truck)
can be repeated for multiple vehicles and so could product brand (Toyota, Ford, Nissan), product make
(Camry, Corolla, Maxima) and model (LE, XLE, SE and so forth). So a vehicle table in a relational system is
likely to have foreign keys relating to vehicle category, vehicle brand, vehicle make and vehicle model.
However in the dimensional star schema model you simply list out the names of each vehicle attribute.
Star schema also contains the entire dimension hierarchy within a single table. Dimension hierarchy
provides a way of aggregating data from the lowest to highest levels within a dimension. For example,
Camry LE and Camry XLE sales roll up to Camry make, Toyota brand and cars category. Here is what a star
schema diagram could look like:
File:ASDW3 138.gif
Notice that each dimension table has a primary key. The fact table has foreign keys to each dimension table.
Although data warehouse does not require creating primary and foreign keys, it is highly recommended to do
so for two reasons:
1. Dimensional models that have primary and foreign keys provide superior performance, especially for
processing Analysis Services cubes.
2. Analysis Services requires creating either physical or logical relationships between fact and dimension
tables. Physical relationships are implemented through primary and foreign keys. Therefore if the keys
exist you save a step when building cubes.
Snowflake schema resembles a snowflake because dimension tables are further normalized or have parent
tables. For example we could extend the product dimension in the dealership warehouse to have a
product_category and product_subcategory tables. Product categories could include trucks, vans, sport
utility vehicles, etc. Product subcategory tables could contain subcategories such as leisure vehicles,
recreational vehicles, luxury vehicles, industrial trucks and so forth. Here is what the snowflake schema
would look like with extended product dimension:
File:ASDW3 139.gif
Snowflake schema generates more joins than a star schema during cube processing, which translates into
longer queries. Therefore it is normally recommended to choose the star schema design over the snowflake
schema for optimal performance. Snowflake schema does have an advantage of providing more flexibility,
however. For example, if you were working for an auto parts store chain you might wish to report on car
parts (car doors, hoods, engines) as well as subparts (door knobs, hood covers, timing belts and so forth). In
such cases you could have both part and subpart dimensions, however some attributes of subparts might
not apply to parts and vise versa. For example, you could examine the thread size attribute would apply to a
tire but not for nuts and bolts that go on the tire. If you wish to aggregate your sales by part you will need to
know which subparts should rollup to each part as in the following:
Dim_subpart
subpart_key
subpart_name
subpart_SKU
subpart_size
subpart_weight
subpart_color
part_key
Dim_part
part_key
part_name
part_SKU
With such a design you could create reports that show you a breakdown of your sales by each type of
engine, as well as each part that makes up the engine.
Data Warehousing - Extraction, Transformation and Loading
A data warehouse does not generate any data; instead it is populated from various transactional or
operational data stores. The process of importing and manipulating transactional data into the warehouse is
referred to as Extraction, Transformation and Loading (ETL). SQL Server supplies an excellent ETL tool
known as Data Transformation Services (DTS) in version 2000 and SQL Server Integration Services (SSIS)
in version 2005.
ETL resolves the inconsistencies in entity and attribute naming across multiple data sources. For example
the same entity could be called customers, clients, prospects or consumers in various data stores.
Furthermore attributes such as address might be stored as three or more different columns (address line1,
address line2, city, state, county, postal code, country and so forth). Each column can also be abbreviated or
spelled out completely, depending on data source. Similarly there might be differences in data types, such
as storing data and time as a string, number or date. During the ETL process data is imported from various
sources and is given a common shape.
In addition to the changes that you can manage in ETL relatively easily, there are some data inconsistencies
that you might have to fix manually. For example, examine the following data values:
Dr. Jimmy Smith

James L. Smith, Jr.
Jim L Smith, M.D.
James Smith MD
Jim Smith, JR - M.D.
A human eye can easily suspect that all of these values could represent the same person. However unless
you work with James Smith or his accounts you cannot be certain. Should you show each of these values as
a separate person on your reports  Writing a program that can fix such data inconsistencies could be a
challenge, whereas a data entry clerk that created these values might be able to change them to a single,
correct value with minimal effort.
Data inconsistencies are commonplace in operational data sources that allow free form data entry. A data
warehouse cannot fix problems with poorly designed operational systems, but it is likely to make such issues
known to data analysts and business managers. Even if you design smart ETL logic to correct the existing
issues predicting all future variations of "Doctor Jim L Smith Junior" is a daunting task. Instead you should
attempt to fix the data entry applications to limit the human error.
In addition to importing data from various sources, ETL is also responsible for transforming data into a
dimensional model. Depending on your data sources the import process can be relatively simple or very
complicated. For example, some organizations keep all of their data in a single relational engine, such as
SQL Server. Others could have numerous systems that might not be easily accessible. In some cases you
might have to rely on scanned documents or scrape report screens to get the data for your warehouse. In
such situations you should bring all data into a common staging area first and then transform it into a
dimensional model.
The need for a staging database isn't limited to those warehouses that have inaccessible data sources. A
staging area also provides a good place for assuring that your ETL is working correctly before data is loaded
into dimension and fact tables. So your ETL could be made up of multiple stages:
1. Import data from various data sources into the staging area.
2. Cleanse data from inconsistencies (could be either automated or manual effort).
3. Ensure that row counts of imported data in the staging area match the counts in the original data
source.
4. Load data from the staging area into the dimensional model.
Retrieved from "http://sqlserverpedia.com/wiki/Data_Warehousing_-
_Extraction,_Transformation_and_Loading"
Definitions: Fact table, Dimension table
Fact table
A fact table consists of the measurements, metrics or facts of a business
process. It is often located at the centre of a star schema, surrounded by
dimension tables.
Fact tables provide the (usually) additive values that act as independent
variables by which dimensional attributes are analyzed. Fact tables are often
defined by their grain. The grain of a fact table represents the most atomic level
by which the facts may be defined.
● Additive - Measures that can be added across all dimensions.
● Non Additive - Measures that cannot be added across all dimensions.
● Semi Additive - Measures that can be added across few dimensions and not with
others.
A fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called
summary tables).
In the real world, it is possible to have a fact table that contains no measures or
facts. These tables are called "Factless Fact tables".
Dimension table
A dimension table is one of the set of companion tables to a fact table. The fact
table contains business facts or measures and foreign keys which refer to
candidate keys (normally primary keys) in the dimension tables. The dimension
tables contain attributes (or fields) used to constrain and group data when
performing data warehousing queries.
Over time, the attributes of a given row in a dimension table may change. For
example, the shipping address for a company may change. Kimball refers to this
phenomenon as Slowly Changing Dimensions. Strategies for dealing with this
kind of change are divided into three categories:
● Type One - Simply overwrite the old value(s).
● Type Two - Add a new row containing the new value(s), and distinguish between
the rows using Tuple-versioning techniques.
● Type Three - Add a new attribute to the existing row.

Data Warehousing Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing Notes

Uploaded by

Copyright:

Available Formats

Dimensional Modeling Concept

Fact Table and Dimension Tables in a Dimensional Model Schema

Fact Table- The central linkage in Dimensional Modeling

Dimension Table- What does and should it contain

Dimensional Model Schemas- Star, Snow-Flake and Constellation

Dimensional Model Star Schema using Star Query

Snow-Flake Schema in Dimensional Modeling

Fact Constellation Schema

Dimensional Modeling vs. Relational Modeling

How Dimensional model is different from an E-R diagram?

Data Warehousing - Dimensions & Measures and Related Concepts

Data Warehousing - Fact and Dimension Tables

Jump to: navigation, search

Data Warehousing - Star and Snowflake Schemas

Data Warehousing - Extraction, Transformation and Loading

Jump to: navigation, search

Dr. Jimmy Smith

2. Cleanse data from inconsistencies (could be either automated or manual effort).

Definitions: Fact table, Dimension table

You might also like