About Slowly Changing Dimensions SAS

About Slowly Changing Dimensions
Slowly Changing Dimensions Defined

Slowly changing dimensions (SCD) is the name of a process that loads data into dimension tables. This data changes
slowly, rather than changing on a time-based, regular schedule. The dimension tables are structured so that they
retain a history of changes to their data. This record of data changes provides a basis for analysis.
As shown in the following diagram, dimension tables combine with fact tables to form star schemas. Fact tables store
numeric events. Dimension tables store the detail data that describes the events. Key columns in the tables connect
events to details. For example, a star schema might store product sales numbers in a fact table, and use dimension
tables to store information about customers, suppliers, and retail locations.
You can use SAS Data Integration Studio to load data into star schemas and analyze data to extract knowledge from
the star schema.
The Star Schema and SAS Data Integration Studio

In SAS Data Integration Studio, the process of loading dimension tables takes place in the SCD Type 1 Loader and
SCD Type 2 Loader transformations. Fact tables are loaded with the Lookup transformation.
Types of Slowly Changing Dimensions
The three most common types of slowly changing dimensions are defined as follows:

Type 1 SCD: no history of data changes
overwrites specified columns in dimension tables without retaining a history of changes. Type 1 SCD is useful for
maintaining less-significant columns that are not used in historical analysis. In SAS Data Integration Studio, the SCD
Type 1 Loader transformation performs Type 1 updates. You can use the SCD Type 2 Loader transformation to
combine Type 1 and Type 2 updates in a single operation.

Type 2 SCD: full history of data changes
maintains multiple records for each business key in the dimension table. The latest entry is the current entry for that
business key. Other rows comprise the historical record of data changes. New entries create new current rows. This
comprehensive record of data changes is the primary purpose of the SCD Type 2 Loader transformation.

Type 3 SCD: limited history of data changesmaintains a limited history of changes using multiple columns for selected
variables. For example, a Type 3 dimension table containing customer information has columns named New Postal
Code, Old Postal Code, and Oldest Postal Code. Data is moved from column to column during the loading process.
Type 3 SCD has less analytical value than Type 2 SCD.

Transformations That Support Slowly Changing Dimensions
SAS Data Integration Studio provides the following transformations that you can use to implement slowly changing
dimensions:

SCD Type 1 Loader inserts new rows, updates existing rows, and generates surrogate key values in a dimension
table without maintaining a history of data changes. Each business key is represented by a single row in the
dimension table.

SCD Type 2 Loader inserts new and Type 2 rows, updates existing rows, and generates surrogate key values in a
dimension table. At the same, it maintains a full history of data changes. Each business key is represented by a
current row and 0 through n number of closed out rows. The closed out rows enable change analysis over time.

Compare Tables detects differences between matching rows in specified columns in two tables. Outputs include
changed, new, unchanged, and missing records tables. These outputs can be used as the basis for performing Type 1
or Type 2 updates in a dimension table.

Lookup loads source data into fact tables and loads foreign keys from dimension tables, with configurable exception
handling. The lookup process accesses dimension tables by using hash objects for optimal performance.

Key Effective Date updates dimension tables based on changes to the business key, when change detection is
unnecessary.

Surrogate Key Generator generates unique key numbers for dimension tables in a manner that is similar but less
feature-rich than the SCD Type 2 Loader transformation. Use the Surrogate Key Generator when key generation is
the sole task that is required at that point in the job.

SCD Project Stages
The process for loading a star schema for slowly changing dimensions follows these general steps:
1. Stage operational data. In this initial step you capture data and validate the quality of that data. Your staging jobs
make use of the Data Validation transformation, along with other data quality transformations and processes.
2. Load dimension tables. Data from the staging area is moved into the dimension tables of the star schema. Dimension
tables are loaded before the fact table in order to generate the primary key values that are needed in the fact table.
3. Load the fact table. In this final step, you run a job that includes the Lookup transformation. This job loads numerical
columns from the staging area into the fact table. Then the Lookup transformation captures foreign key values from
the dimension tables.

About Dimension Tables
About Change Tracking
Dimension tables that are loaded with the SCD Type 2 Loader consist of a surrogate key column, a business key
column, change tracking columns, and any number of detail data columns. The surrogate key column is often loaded
with values that are generated by the transformation. The business keys are supplied in the source data. Both the
business key and the surrogate key can be defined to consist of more than one column, as determined by the
structure of the source data. A surrogate key is typically a system-generated value that contains no semantic
meaning. It is almost always a numeric value that you can use to improve join performance between fact and
dimension tables.
Change tracking columns can consist of begin and end datetime columns, a version number column, or a current-row
indicator column. You can combine tracking methods as needed to optimize your analyses. Using a current-row
indicator column improves the performance of the SCD Type 2 Loader.
Begin and end datetime values specify the period of time in which each row was the current row for that member. The
following diagram shows how data is added to begin and end datetime columns. The begin datetime for the new
current row is one second greater than the end datetime of the former current row. The end value for the current row
is a placeholder future date.

Structure of an SCD Dimension Table

Tracking changes by version number increments a counter when a new row is added. The current row has the highest
version number for that business key. The version number for new business keys is current_version_number + 1.
Tracking changes using a currentrow indicator column loads a 1 for the current row and 0s for all of the other rows
that apply to that same member.
The preceding diagram shows a surrogate key column, the values for which are generated by the SCD Type 2
Loader. The generated surrogate key is necessary in order to uniquely identify individual rows in the dimension table.
The generated surrogate key values are loaded into the star schema's fact table as foreign keys, to connect factual or
numerical events to the detail data that describes those events.

About Change Detection and Loading for SCD

In jobs that run the SCD Type 2 Loader transformation, the dimension table loading process repeats the following
process for each source row:
1. Compare the business key of the source row to the business keys of all of the current rows in the dimension table. If
no match is found, then the source row represents a new member. The source row is written to the target as the new
current member for that business key. The current member contains the latest information. The loading process
moves to the next source row.
2. If the business key in the source matches a business key in the target, then specified detail data columns are
compared between the matching rows. If no differences in data are detected, then the source row is a duplicate of the
target row. The source row is not loaded into the target as the new current row for that business key. The loading
process moves on to the next source row.
3. If business keys match and data differences are detected in the columns specified for Type 2 SCD, then the source
row represents a new current row for that member. The source row is written to the target, and the previous current
row for that member is closed out. To close out a row, the change tracking column or columns are updated as
specified, depending on the selected method of change tracking. If changes are detected in the Type 1 columns in
Type 1 upgrades, the source data overwrites the target data in the current row. The data is overwritten even when
data differences are not detected in the Type 2 columns.

About Generated Keys

The SCD Type 2 Loader enables you to generate surrogate key values when you load a dimension table. The
generated surrogate key values replace the business key as the primary key. This is because the business key from
the source table identifies the member, not the unique row in the dimension table.
You can configure a simple surrogate key in the Generated Keys tab of the SCD Type 2 Loader. This surrogate key
increments the highest existing value in a specified column for each new row. You can also use an expression to
generate key values in other increments. To specify a unique starting point for the keys that are generated in each
load, you can specify a lookup column. The initial key value is the highest value in the lookup column.

Note: When loading a fact table instead of a dimension table, you can generate simple surrogate keys using the
Lookup transformation.
In addition to surrogate keys, you can also generate retained keys. Retained keys provide a primary key value that
consists of two columns, the begin datetime change tracking column and a numeric column that receives generated
values. The combination of the two columns uniquely identifies each row in the table.
The generated value is retained because a single generated value is applied to all of the rows that apply to a given
member. When a new row is added to an existing member, it receives the same generated value as the other rows
that apply to that member.
As with surrogate keys, you can generate retained key values using expressions and lookup columns.
In order to generate unique retained keys, begin and end datetime change tracking is required.
To enhance performance, you should create an index for your generated key column. If you identify your generated
key column as the primary key of the table, then the index is created automatically. Surrogate keys should receive a
unique or simple index that consists of one column. Retained keys should receive a complex index that includes the
generated key column and the beginning datetime column.
To create an index, open the Properties dialog box for the table and use the Index and Keys tabs.

About Cross-Reference Tables
During the process of loading an SCD dimension table, the comparison of incoming source rows to the current rows in
the target is facilitated by a cross-reference table. The cross-reference table consists of all of the current rows in the
dimension table, one row for each member. The columns consist of the generated key, the business key, and a digest
column named DIGEST_VALUE.

The digest column is used to detect changes in data between the source row and the target row that has a matching
business key. DIGEST_VALUE is a character column with a length of 32. The values in this column are encrypted
concatenations of the data columns that were selected for change detection. The encryption uses the MD5 algorithm,
which is described in detail at http://www.faqs.org/rfcs/rfc1321.html.
If a cross-reference table exists and has been identified, it is used and updated. If a cross-reference table has not
been identified, then a new temporary table is created each time you run the job.
To increase performance in large jobs, enable change tracking by current row indicator. This method of change
tracking can be combined with the other change tracking methods (begin and end datetime and version number). The
current row indicator speeds up the process of creating or updating the digest file. The performance improvement is
provided by a WHERE clause that efficiently separates current rows from closed-out rows.
Cross-reference tables are identified on the Options tabs of the following transformations: SCD Type 2 Loader and
Key Effective Date, in the field Cross-Reference Table Name.

Two Methods for Generating the Change Digest Column
The SCD Type 2 Loader supports two methods for generating the change digest column (DIGEST_VALUE column) in
a cross-reference table. To specify one of these methods, open the properties window for that transformation and
select Options SCD Options. Select v.1.1 orv2.1 in the Change digest version field.
The v1.1 method is the default. The v2.1 method uses a different method to concatenate the data columns that were
selected for change detection. Try v2.1 if the SCD Type 2 transformation does not detect changes in certain
scenarios.
For example, suppose that two consecutive Type 2 columns with data type char are loaded on day X with the
following values:
col_1="AB" , col_2="C"
The following delta record might contain these values:
col_1="A" , col_2="BC"
The v1.1 method uses string handling functions that would concatenate these values into an intermediate string value
of "ABC" for both the original record and the delta record. As a result, identical DIGEST_VALUEs would be incorrectly
generated for both records, and the change will not be detected. If you encounter this problem, try the v2.1 method,
which uses different string handling functions.

About Type 1 Updates
Type 1 updates are defined as overwrites of existing data in specified columns. When you run a Type 1 update with
the SCD Type 2 Loader transformation, digest values containing the Type 1 columns are created for the source and
target. The digest values are then compared to determine the target rows that need to be updated. When the rows are
updated, the number of writes is optimized.
You can combine Type 2 and Type 1 updates in the same job. Use Type 2 updates to maintain a history of changes
for important columns. Use Type 1 updates to maintain accurate and complete information in your dimension table,
without generating new target rows for each change.

About Fact Tables
Overview
Fact tables are combined with dimension tables to make up star schemas. Fact tables describe events using numeric
data. Dimension tables provide detail data that describe the events. Examples of factual events include the sale of an
item or a transaction in a bank account. Each such event is represented by a single row in a fact table.
The columns in a fact table consist of one or more numeric columns that relate to an event and a series of foreign key
columns that connect the event to the detail data in the dimension tables.

About the Loading of Fact Tables with the Lookup Transformation
To load data into a fact table, use the Lookup transformation in a SAS Data Integration Studio job. The Lookup
transformation generates primary key values, loads numeric fact data from a source table, and loads foreign keys from
dimension tables using a lookup process.
The lookup process runs separately for each dimension table that contributes foreign keys. The process compares
business key values between the source table and a dimension table. If a match is found, an expression (a WHERE
clause) is evaluated to identify the specific dimension table row in that business key. In general, the values that are
loaded from the dimension table are the primary key columns. Loading these foreign keys into the fact table allows
each event to contain references to all of the detail data that describes that event.
If no match is found in a dimension table, or if a value is missing, then the numeric data in the source row is not
loaded into the fact table and the exception condition is processed by the Lookup transformation. Each exception
condition triggers one or more available actions, including the termination of the job, the loading of source data into an
error table, and the loading of information into an exception table.

About Slowly Changing Dimensions SAS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

About Slowly Changing Dimensions SAS

Uploaded by

Copyright:

Available Formats

About Slowly Changing Dimensions

Slowly Changing Dimensions Defined

You might also like