Tuesday, February 25, 2003

Data Warehouse Modeling guideline Part 1: Dimension

Here's the first part of my ad-hoc modeling guideline listing general advice as well as more specific ones related to dimension. Please bare with me all this "techie jargon", it is mostly stolen from R. Kimball terminology in dimensional data model. This guideline assumes a good understanding and knowledge on datawarehouse concept and data modeling.


Atomic Data Modeling (in the sense of most granular transactional data)

General

1. Limit the first implementation to a minimum number of star schema and number of source data. Usually a first implementation would involve a single transaction table accompanied with its snapshot state.

2. Avoid the use of generic terms for naming dimension and columns, especially at the start of the implementation process of datamarts.

3. Try to impose as many constraints as possible ("Not Null") for all attributes upfront and not after the creation of the table. Fact measure or Foreign Key should never be equal to Null to facilitate indexing and ETL processing.

4. Avoid the mix of transaction based and snapshot based fact records inside the same fact table. This usually results in a variable time interval fact with two links with timestamp: an Effective date and an End date, the effective is usually triggered by an event while the period defines a state which finish at the End date. This generates a number of limitations such as: 1- difficulty to evolve the model with new dimension, as when we add any changes this affects the previous history of record, 2- time axis does not have a unique definition in reporting tool such as BusinessObjects, 3- query of type EOP which counts number of record having certain status are usually slower and more cumbersome in SQL.



Dimension Entity

1. Junk dimension

1.1. Use junk dimension to gather various types of unrelated and unstructured information that would generate too many mini-dimensions.

1.2. Do not use junk dimension when data is volatile and/or the domain of information covered by these tends to grow.

1.3. Junk dimension can sometimes be used when there is a need to tie multiple dimensions into a large volume star schema that cannot easily accommodate for new dimensions.

1.4. Junk dimension usually needs to be accompanied with database View that make more sense for the business user (can be done at reporting tool level as well)

1.5. Junk dimension can sometimes be used to support a many-to-many relation with the fact. This is only recommended for a dimension having the many-to-many relationship upper bounded with a relatively small number (i.e. the number of columns necessary to have in the dimension is small and static).


2. Big dimension

2.1. Try to split this dimension into smaller dimensions, otherwise the SCD type II is unrealistically implemented.

2.2. (2.1 continued) For example, customer can often be splitted into entities related with: Identity, Account, Demographic/Geographic and Value-based derivable attributes.

2.3. (2.1 continued) The relationship between these new dimensions could be evaluated as fact, i.e. plugged directly into a history fact table (e.g. customer relating to demographic/geographic) or directly linked with snowflakes (usually the link is SCD type I).

2.4. Big dimension often hides multiple smaller dimensions.


3. Interrelated dimensions

3.1. Do not store attributes from external dimension into another dimension (e.g. customer). These usually represent relationship (usually current situation) that should fundamentally be evaluated through a fact.

3.2. (3.1 continued) Storing these relationships into dimension only creates data redundancy and is prone to data inconsistency.

3.3. (3.1 continued) If no history is needed from those relationships and no detail is necessary (i.e. there is no need to evaluate them as fact), then all dimensions should be merged into the most significant one.

3.4. (3.1 continued) Compromises must be taken between the size of the dimension vs the number of smaller dimensions (see Big dimension tips). This recommendation can have exception, especially when relationship between the two dimensions does not change and that the link between the two is difficult to obtain through the fact (e.g. Customer relating to Dealer). For this exception, it may help and improve performance for aggregation building or query restitution to duplicate the link into the most significant dimension (e.g. customer).


4. Date dimension

4.1. Date aggregated dimension (grouping of days) usually encompasses inconsistent time period. These periods should be modeled as independent hierarchy structures. Period not having a fixed number of days (e.g. week period as defined in Oracle) should have a Period_size indicator for averaging.

4.2. (4.1 continued) This offers the advantage of having a single entry point (one dimension to create) for storing all various independent period structures such as week-year, month-quarter-year, holiday-year, etc.


5. Exceptions in dimension

5.1. Dimension-scope exceptions are applicable whenever there is a hierarchy level attribute not applicable for the specifics dimensions member. This attribute must take the exception value.

Martin

No comments: