Tuesday, March 11, 2003

Data Warehouse Modeling guideline Part 3: Aggregate Data Modeling

Here's the third and final part of my guideline, ....this time I'm covering very briefly a complex subject : the aggregation strategy.


General

1. A first level of aggregation should be built on top of the atomic database level, mainly for 1) boosting query performance done on atomic data, 2) for storing pre-calculate KPI which are usually aggregated data, and optionally to archiving historical data not kept at atomic level.

2. The first level aggregation is best when in the same infrastructure as the atomic level (i.e. relational model). It can be accompanied with a second-level which uses OLAP independent Cube technologies.


Dimension Entity

1. Two roles dimension

1.1. In the context of a migration count aggregation, only the dimension of interest must have the dual roles (Origin and Destination), all other dimension functionally dependent on the migration i.e. are modified after the migration, must be defined and fixed as either Origin or Destination.



Fact Entity

1. Definition of the type of fact

1.1. In aggregation the rule of not mixing two types of facts into the table (instantaneous event-based vs. interval or end of period snapshot-based) can be violated. This is because aggregation usually needs to provide a wealth of information and because there is a need to limit the number of aggregate to build/maintain

1.2. (1.1 continued) The mix can only be possible under the condition that it does not affect the granularity. Granularity must always be fixed, i.e. a state-based EOP aggregation cannot accept event-based counts if these affect the subscriber-EOP by axis grain.

1.3. (1.1 continued) If any event measure affect the grain (may potentially require to have BOP by axis to be represented because number of subs that have cancelled some conditions may not exist in the original grain) then it can't be integrated. In this case, it is much more suited to use OLAP Cube which deal easily with the BOP-EOP comparison.

1.4. (1.1 continued) Furthermore, great care must be applied in doing this, because the occurrence of all axis is inconsistent, event should track occurrence of axis at the time of the event while EOP state number track them at the EOP snapshot time. This implies that computation between measures is not possible by axis.

2. Definition of Measures

2.1. In the aggregation, it is common practice to store explicitly independent measures (e.g # of active subs, # of ghost subs, or Nb of sms, Nb of normal calls) that correspond to various dimension condition. This helps decreasing the granularity of the aggregation table and ignores the dimensions only needed to identify the independent measures.

Martin

Thursday, March 06, 2003

Data Warehouse Modeling guideline Part 2: Fact

Here's the second part of my small ad-hoc modeling guideline applicable to data warehouse using relational database.

Fact Entity

1. Definition of the type of fact
1.1. Never mix two types of facts into the same fact table (instantaneous event-based vs. interval or end of period snapshot-based). Exceptions to this rule are frequent in the aggregation layer (see rule in Aggregation section).

1.2. Dimension definition must restrict the use of this dimension to fact of same nature (e.g. a dimension defining transaction type must not be tied into snapshot fact type).

1.3. (1.2 continued) The nature of the dimension must be clear enough, in other words its definition should not depend on the fact it is tied to.



2. Definition of Measures

2.1. Store an attribute as measure into the fact table only when this detail info is necessary (e.g. for legal issues) but no end-user query will use this as constraint.

2.2. Every limited value space attribute should be designed as dimension and not measure.

2.3. Try avoiding the use of intelligent measure taking a contextual meaning linked to some business rule. All measures should be modeled denormalised even if this implies space wastage.

2.4. (2.3 continued) Because these are usually encapsulated into "hidden flag", it is subject to misunderstanding and misuse in the report.


3. Dimension choice to tie with fact

3.1. All dimensions that qualify the facts should be present in order to avoid any uncertainty in one fact (the goal is to have a fully qualified fact). When a single amount is described only partly by one or many dimensions, it must be corrected by adding every new dimension that will fully qualify the amount.

3.2. Limit the number of dimension for the first implementation of a star (especially for large fact table), so it can evolve with time.

3.3. Try not to tie dimensions that are not directly a characteristic of the fact (e.g. derived dimension dependent on different relationship, subscriber having a has-a relationship with other dimension), these relationships should be modeled in the aggregation strategy.



4. Snapshot-based fact

4.1. Although it seems possible to mix two interval periods (e.g. weeks, months) into the same fact, this will always result into difficulty with future requirements implying usually the addition of new facts. Note that this apply only when the same Date hierarchy is used for all intervals period, meaning that end of period that coincides on the same day will use the same surrogate key.


5. Exceptions in fact

5.1. Exception handling for a fact is necessary whenever the fact is not applicable throughout the dimension space. That is handled with foreign keys (usually negative keys) pointing to an exception for all the irrelevant dimensions.

5.2. (5.1 continued) The exceptions applied to the entire fact (fact-scope) require specific exceptions values stored over the whole hierarchy dimension structure.

5.3. (5.1 continued) These fact-scope exceptions are commonly of the following nature: 1- Error, 2- Unknown, 3- Not Applicable, 4-To be defined, 5- No history, 6- No reference, etc.

Martin