Professional Documents
Culture Documents
Dimensional Modeling
An Overview of Dimensional Modeling by Haitham Salawdeh
Haitham Salawdeh
Page 2
June, 2009
Dimensional Modeling
Introduction:
Multi Dimensional Modeling (DM) stands in contrast to the normalized model (NM) in many ways. First and for most, modeled dimensionally, a structure is easier to navigate and understand. This is a plus especially when the user of the model is not familiar with database technologies and tools. Next, the audience of a normalized model is generally a developer or someone capable of navigating the normalized structures effectively. Lastly, the emphasis of a normalized model is on reducing data redundancy and making sure a datum is updated once and in one place. When ease of user navigation and performance are the primary concern data normalization becomes an obstacle. DM is generally used in the context of data warehousing. In this context the data warehouse is loaded once and accessed many times. On the other hand, NM is well suited for systems where updates/inserts and deletes happen frequently. This Overview is meant to introduce the concepts and language of Multi-Dimensional Modeling (DM). I will not spend a lot of time developing the motivation for DM or Warehousing.
Haitham Salawdeh
Page 3
June, 2009
Dimensional Modeling
Figure 1: A transaction fact table for a bank. There are times when a star has some of its portions normalized. The design is then said to be snow-flaked. An example of a snowflake design is when a security dimension table has a relationship to an industry classification table. To undo the snow flake the relationships generating it will have to be flattened. In the earlier example, the industry assignment will be made part of the security table. Later we will learn about a snowflake that cannot be avoided. Figure 2 illustrates such a situation where the relationship between accounts and customer cannot be flattened and need to be normalized.
Haitham Salawdeh
Page 4
June, 2009
Dimensional Modeling
Figure 2: A simple balance fact of a bank and its dimensions. It is important to note that when one is modeling a business process one should re-use what has been modeled previously. Other modeling sessions might have defined dimensions and facts that can be reused. This notion is also known as conforming dimension. When developing a warehouse iteratively this idea becomes paramount. The use of a dimension is conforming if we use an already defined dimension or a subset of it. Conforming dimensions coupled with conforming facts are the foundation of the warehouse bus architecture advocated by Ralph Kimball.
Haitham Salawdeh
Page 5
June, 2009
Dimensional Modeling While a transactional snapshot represents a point in time, a periodic snapshot represents a pre-defined interval or a period. A daily balance fact table of a bank as depicted in Figure 2 is an example of a periodic snapshot.
Furthermore, an accumulative snapshot represents business activities over a time period. An accumulative fact could represent a fulfillment process of a mutual fund company. A row in that table will capture a customers first contact date, then the literature send date and finally an account open date. As a consequence, this is one of the only times when a fact table is updated in a data warehouse. Otherwise, a fact table row is not updated after it is loaded. Finally, a fact table does not necessarily have to have measures. It could simply be a bridge between dimensions. In that case the fact table is said to be factless.
Dimension Tables:
The dimensions define a structure around the facts. It is imperative that all dimensions be demoralized to ease the navigation of the model. In some cases dimensions have many-to-many relationships that cannot be flattened. It is also possible that the manyto-many is between a fact and dimension. It is appropriate then to have a multi-valued dimension to implement the relationship. A multi-valued dimension is simply a bridge table between the entities involved in the many-to-many relationship. Figure 2 shows a multi-valued dimension table depicting the account ownership. An account can have multiple relationships to customers including: primary, secondary and custodian. In addition, the number of secondary customers for an account might not be bound. It is best in such situations to normalize the dimensional model as we have done here. Furthermore, when attempting to conform dimension we are faced with using the same dimension with different name. An example of that is when a transaction fact table refers to a settlement date and a transaction date. The two dates should refer to a date dimension that is conforming. A way to do that is to introduce views on top of the date dimension for the settlement and transaction date. In this case the date is said to be role-playing. For example, in Figure 1 the TransactionDate dimension can be a view over the Date dimension depicted in Figure 2. In addition, it is possible to be left with some attributes that do not fit in any of the extracted dimensions and dont group well together. In that case it is not recommended to keep these attributes with the fact table. Instead, they can be pulled out into a Junk Dimension. Figure 3 shows how 2 indicators were grouped into a junk dimension. The TransactionIndicators dimension will have 4 rows and they are the combination of the 2 indicators possible values. As we add more indicators and multi-
Haitham Salawdeh
Page 6
June, 2009
Dimensional Modeling valued columns this dimension will grow. At one point we might need to split a junk dimension if the collection of columns is highly un-related.
Do Nothing:
In the datamodeling literature, this is known as type 1 SCD. In this scenario you simply overwrite the old values with the new. History is then lost forever. If there is no requirement to keep history this might be the approach you need to take. However, it is important to understand that this approach does come at a cost. All cubes will have to be rebuilt. Otherwise, reports and cubes will have dead paths.
Dimensional Modeling for a security if it were to change cusip. This allows us to represent history accurately. New holdings will point to the new dimensions and old holdings will continue to point to the old one. The drawback of this approach is that it does not allow us to associate the old facts with the new values. Depending on the requirements this could be unacceptable.
Hybrid Approach:
I personally like the use of type 2 SCD to capture the change and type 3 to chain link the dimensions. In the case of the cusip change example earlier, I would create a new security dimension row with the new values and add a new column to the security dimension called previous security. This column will point to the security that just changed. While it is hard for a user to navigate the chain and while I, otherwise, refrain from using recursion, I feel this approach meets many requirements. It might require IT or some technical users to assist in predefining navigation paths for less experienced users.
Other considerations:
In some dimensions some fields change much more frequently than others. It is admissible to break out the fields that change more rapid from the others. However, the breakup has to be done in a way that makes sense to the business user of the model who is not aware of the technical motivating the split.
Steps to a DM:
Now that we talked about the terminology of DM, where do we create a dimensional model? The initial step to creating a DM is to decide what business process(es) to model. As part of that the business requirements need to be harvested and the available data need to be understood.
Haitham Salawdeh
Page 8
June, 2009
Dimensional Modeling The next step is to decide on the grain of the model. The best and most flexible approach is to go for the lowest available grain. By doing this you give the user of your model a way to summarize data in an ad-hoc manner. It also allows your users to drill from a summary view to supporting details. In addition, keep all the measures of a fact table at the same grain. An example of grain declaration when modeling holdings for a fund company would state, We will model account holdings per day. The third step is to define the dimensions. In the above example the dimensions for a holding will include portfolio, holding date and security. Finally, we derive the facts described by the dimensions. The facts in the above example will include holding market value and holding cost basis.
References:
Ralph Kimball and Margy Ross (2002). The Data Warehouse Toolkit. Second Edition. New York: Wiley Computer Publishing.
Haitham Salawdeh
Page 9
June, 2009