Here's the third and final part of my guideline, ....this time I'm covering very briefly a complex subject : the aggregation strategy.
General
1. A first level of aggregation should be built on top of the atomic database level, mainly for 1) boosting query performance done on atomic data, 2) for storing pre-calculate KPI which are usually aggregated data, and optionally to archiving historical data not kept at atomic level.
2. The first level aggregation is best when in the same infrastructure as the atomic level (i.e. relational model). It can be accompanied with a second-level which uses OLAP independent Cube technologies.
Dimension Entity
1. Two roles dimension
1.1. In the context of a migration count aggregation, only the dimension of interest must have the dual roles (Origin and Destination), all other dimension functionally dependent on the migration i.e. are modified after the migration, must be defined and fixed as either Origin or Destination.
Fact Entity
1. Definition of the type of fact
1.1. In aggregation the rule of not mixing two types of facts into the table (instantaneous event-based vs. interval or end of period snapshot-based) can be violated. This is because aggregation usually needs to provide a wealth of information and because there is a need to limit the number of aggregate to build/maintain
1.2. (1.1 continued) The mix can only be possible under the condition that it does not affect the granularity. Granularity must always be fixed, i.e. a state-based EOP aggregation cannot accept event-based counts if these affect the subscriber-EOP by axis grain.
1.3. (1.1 continued) If any event measure affect the grain (may potentially require to have BOP by axis to be represented because number of subs that have cancelled some conditions may not exist in the original grain) then it can't be integrated. In this case, it is much more suited to use OLAP Cube which deal easily with the BOP-EOP comparison.
1.4. (1.1 continued) Furthermore, great care must be applied in doing this, because the occurrence of all axis is inconsistent, event should track occurrence of axis at the time of the event while EOP state number track them at the EOP snapshot time. This implies that computation between measures is not possible by axis.
2. Definition of Measures
2.1. In the aggregation, it is common practice to store explicitly independent measures (e.g # of active subs, # of ghost subs, or Nb of sms, Nb of normal calls) that correspond to various dimension condition. This helps decreasing the granularity of the aggregation table and ignores the dimensions only needed to identify the independent measures.
Martin
Tuesday, March 11, 2003
Thursday, March 06, 2003
Data Warehouse Modeling guideline Part 2: Fact
Here's the second part of my small ad-hoc modeling guideline applicable to data warehouse using relational database.
Fact Entity
1. Definition of the type of fact
1.1. Never mix two types of facts into the same fact table (instantaneous event-based vs. interval or end of period snapshot-based). Exceptions to this rule are frequent in the aggregation layer (see rule in Aggregation section).
1.2. Dimension definition must restrict the use of this dimension to fact of same nature (e.g. a dimension defining transaction type must not be tied into snapshot fact type).
1.3. (1.2 continued) The nature of the dimension must be clear enough, in other words its definition should not depend on the fact it is tied to.
2. Definition of Measures
2.1. Store an attribute as measure into the fact table only when this detail info is necessary (e.g. for legal issues) but no end-user query will use this as constraint.
2.2. Every limited value space attribute should be designed as dimension and not measure.
2.3. Try avoiding the use of intelligent measure taking a contextual meaning linked to some business rule. All measures should be modeled denormalised even if this implies space wastage.
2.4. (2.3 continued) Because these are usually encapsulated into "hidden flag", it is subject to misunderstanding and misuse in the report.
3. Dimension choice to tie with fact
3.1. All dimensions that qualify the facts should be present in order to avoid any uncertainty in one fact (the goal is to have a fully qualified fact). When a single amount is described only partly by one or many dimensions, it must be corrected by adding every new dimension that will fully qualify the amount.
3.2. Limit the number of dimension for the first implementation of a star (especially for large fact table), so it can evolve with time.
3.3. Try not to tie dimensions that are not directly a characteristic of the fact (e.g. derived dimension dependent on different relationship, subscriber having a has-a relationship with other dimension), these relationships should be modeled in the aggregation strategy.
4. Snapshot-based fact
4.1. Although it seems possible to mix two interval periods (e.g. weeks, months) into the same fact, this will always result into difficulty with future requirements implying usually the addition of new facts. Note that this apply only when the same Date hierarchy is used for all intervals period, meaning that end of period that coincides on the same day will use the same surrogate key.
5. Exceptions in fact
5.1. Exception handling for a fact is necessary whenever the fact is not applicable throughout the dimension space. That is handled with foreign keys (usually negative keys) pointing to an exception for all the irrelevant dimensions.
5.2. (5.1 continued) The exceptions applied to the entire fact (fact-scope) require specific exceptions values stored over the whole hierarchy dimension structure.
5.3. (5.1 continued) These fact-scope exceptions are commonly of the following nature: 1- Error, 2- Unknown, 3- Not Applicable, 4-To be defined, 5- No history, 6- No reference, etc.
Martin
Fact Entity
1. Definition of the type of fact
1.1. Never mix two types of facts into the same fact table (instantaneous event-based vs. interval or end of period snapshot-based). Exceptions to this rule are frequent in the aggregation layer (see rule in Aggregation section).
1.2. Dimension definition must restrict the use of this dimension to fact of same nature (e.g. a dimension defining transaction type must not be tied into snapshot fact type).
1.3. (1.2 continued) The nature of the dimension must be clear enough, in other words its definition should not depend on the fact it is tied to.
2. Definition of Measures
2.1. Store an attribute as measure into the fact table only when this detail info is necessary (e.g. for legal issues) but no end-user query will use this as constraint.
2.2. Every limited value space attribute should be designed as dimension and not measure.
2.3. Try avoiding the use of intelligent measure taking a contextual meaning linked to some business rule. All measures should be modeled denormalised even if this implies space wastage.
2.4. (2.3 continued) Because these are usually encapsulated into "hidden flag", it is subject to misunderstanding and misuse in the report.
3. Dimension choice to tie with fact
3.1. All dimensions that qualify the facts should be present in order to avoid any uncertainty in one fact (the goal is to have a fully qualified fact). When a single amount is described only partly by one or many dimensions, it must be corrected by adding every new dimension that will fully qualify the amount.
3.2. Limit the number of dimension for the first implementation of a star (especially for large fact table), so it can evolve with time.
3.3. Try not to tie dimensions that are not directly a characteristic of the fact (e.g. derived dimension dependent on different relationship, subscriber having a has-a relationship with other dimension), these relationships should be modeled in the aggregation strategy.
4. Snapshot-based fact
4.1. Although it seems possible to mix two interval periods (e.g. weeks, months) into the same fact, this will always result into difficulty with future requirements implying usually the addition of new facts. Note that this apply only when the same Date hierarchy is used for all intervals period, meaning that end of period that coincides on the same day will use the same surrogate key.
5. Exceptions in fact
5.1. Exception handling for a fact is necessary whenever the fact is not applicable throughout the dimension space. That is handled with foreign keys (usually negative keys) pointing to an exception for all the irrelevant dimensions.
5.2. (5.1 continued) The exceptions applied to the entire fact (fact-scope) require specific exceptions values stored over the whole hierarchy dimension structure.
5.3. (5.1 continued) These fact-scope exceptions are commonly of the following nature: 1- Error, 2- Unknown, 3- Not Applicable, 4-To be defined, 5- No history, 6- No reference, etc.
Martin
Tuesday, February 25, 2003
Data Warehouse Modeling guideline Part 1: Dimension
Here's the first part of my ad-hoc modeling guideline listing general advice as well as more specific ones related to dimension. Please bare with me all this "techie jargon", it is mostly stolen from R. Kimball terminology in dimensional data model. This guideline assumes a good understanding and knowledge on datawarehouse concept and data modeling.
Atomic Data Modeling (in the sense of most granular transactional data)
General
1. Limit the first implementation to a minimum number of star schema and number of source data. Usually a first implementation would involve a single transaction table accompanied with its snapshot state.
2. Avoid the use of generic terms for naming dimension and columns, especially at the start of the implementation process of datamarts.
3. Try to impose as many constraints as possible ("Not Null") for all attributes upfront and not after the creation of the table. Fact measure or Foreign Key should never be equal to Null to facilitate indexing and ETL processing.
4. Avoid the mix of transaction based and snapshot based fact records inside the same fact table. This usually results in a variable time interval fact with two links with timestamp: an Effective date and an End date, the effective is usually triggered by an event while the period defines a state which finish at the End date. This generates a number of limitations such as: 1- difficulty to evolve the model with new dimension, as when we add any changes this affects the previous history of record, 2- time axis does not have a unique definition in reporting tool such as BusinessObjects, 3- query of type EOP which counts number of record having certain status are usually slower and more cumbersome in SQL.
Dimension Entity
1. Junk dimension
1.1. Use junk dimension to gather various types of unrelated and unstructured information that would generate too many mini-dimensions.
1.2. Do not use junk dimension when data is volatile and/or the domain of information covered by these tends to grow.
1.3. Junk dimension can sometimes be used when there is a need to tie multiple dimensions into a large volume star schema that cannot easily accommodate for new dimensions.
1.4. Junk dimension usually needs to be accompanied with database View that make more sense for the business user (can be done at reporting tool level as well)
1.5. Junk dimension can sometimes be used to support a many-to-many relation with the fact. This is only recommended for a dimension having the many-to-many relationship upper bounded with a relatively small number (i.e. the number of columns necessary to have in the dimension is small and static).
2. Big dimension
2.1. Try to split this dimension into smaller dimensions, otherwise the SCD type II is unrealistically implemented.
2.2. (2.1 continued) For example, customer can often be splitted into entities related with: Identity, Account, Demographic/Geographic and Value-based derivable attributes.
2.3. (2.1 continued) The relationship between these new dimensions could be evaluated as fact, i.e. plugged directly into a history fact table (e.g. customer relating to demographic/geographic) or directly linked with snowflakes (usually the link is SCD type I).
2.4. Big dimension often hides multiple smaller dimensions.
3. Interrelated dimensions
3.1. Do not store attributes from external dimension into another dimension (e.g. customer). These usually represent relationship (usually current situation) that should fundamentally be evaluated through a fact.
3.2. (3.1 continued) Storing these relationships into dimension only creates data redundancy and is prone to data inconsistency.
3.3. (3.1 continued) If no history is needed from those relationships and no detail is necessary (i.e. there is no need to evaluate them as fact), then all dimensions should be merged into the most significant one.
3.4. (3.1 continued) Compromises must be taken between the size of the dimension vs the number of smaller dimensions (see Big dimension tips). This recommendation can have exception, especially when relationship between the two dimensions does not change and that the link between the two is difficult to obtain through the fact (e.g. Customer relating to Dealer). For this exception, it may help and improve performance for aggregation building or query restitution to duplicate the link into the most significant dimension (e.g. customer).
4. Date dimension
4.1. Date aggregated dimension (grouping of days) usually encompasses inconsistent time period. These periods should be modeled as independent hierarchy structures. Period not having a fixed number of days (e.g. week period as defined in Oracle) should have a Period_size indicator for averaging.
4.2. (4.1 continued) This offers the advantage of having a single entry point (one dimension to create) for storing all various independent period structures such as week-year, month-quarter-year, holiday-year, etc.
5. Exceptions in dimension
5.1. Dimension-scope exceptions are applicable whenever there is a hierarchy level attribute not applicable for the specifics dimensions member. This attribute must take the exception value.
Martin
Atomic Data Modeling (in the sense of most granular transactional data)
General
1. Limit the first implementation to a minimum number of star schema and number of source data. Usually a first implementation would involve a single transaction table accompanied with its snapshot state.
2. Avoid the use of generic terms for naming dimension and columns, especially at the start of the implementation process of datamarts.
3. Try to impose as many constraints as possible ("Not Null") for all attributes upfront and not after the creation of the table. Fact measure or Foreign Key should never be equal to Null to facilitate indexing and ETL processing.
4. Avoid the mix of transaction based and snapshot based fact records inside the same fact table. This usually results in a variable time interval fact with two links with timestamp: an Effective date and an End date, the effective is usually triggered by an event while the period defines a state which finish at the End date. This generates a number of limitations such as: 1- difficulty to evolve the model with new dimension, as when we add any changes this affects the previous history of record, 2- time axis does not have a unique definition in reporting tool such as BusinessObjects, 3- query of type EOP which counts number of record having certain status are usually slower and more cumbersome in SQL.
Dimension Entity
1. Junk dimension
1.1. Use junk dimension to gather various types of unrelated and unstructured information that would generate too many mini-dimensions.
1.2. Do not use junk dimension when data is volatile and/or the domain of information covered by these tends to grow.
1.3. Junk dimension can sometimes be used when there is a need to tie multiple dimensions into a large volume star schema that cannot easily accommodate for new dimensions.
1.4. Junk dimension usually needs to be accompanied with database View that make more sense for the business user (can be done at reporting tool level as well)
1.5. Junk dimension can sometimes be used to support a many-to-many relation with the fact. This is only recommended for a dimension having the many-to-many relationship upper bounded with a relatively small number (i.e. the number of columns necessary to have in the dimension is small and static).
2. Big dimension
2.1. Try to split this dimension into smaller dimensions, otherwise the SCD type II is unrealistically implemented.
2.2. (2.1 continued) For example, customer can often be splitted into entities related with: Identity, Account, Demographic/Geographic and Value-based derivable attributes.
2.3. (2.1 continued) The relationship between these new dimensions could be evaluated as fact, i.e. plugged directly into a history fact table (e.g. customer relating to demographic/geographic) or directly linked with snowflakes (usually the link is SCD type I).
2.4. Big dimension often hides multiple smaller dimensions.
3. Interrelated dimensions
3.1. Do not store attributes from external dimension into another dimension (e.g. customer). These usually represent relationship (usually current situation) that should fundamentally be evaluated through a fact.
3.2. (3.1 continued) Storing these relationships into dimension only creates data redundancy and is prone to data inconsistency.
3.3. (3.1 continued) If no history is needed from those relationships and no detail is necessary (i.e. there is no need to evaluate them as fact), then all dimensions should be merged into the most significant one.
3.4. (3.1 continued) Compromises must be taken between the size of the dimension vs the number of smaller dimensions (see Big dimension tips). This recommendation can have exception, especially when relationship between the two dimensions does not change and that the link between the two is difficult to obtain through the fact (e.g. Customer relating to Dealer). For this exception, it may help and improve performance for aggregation building or query restitution to duplicate the link into the most significant dimension (e.g. customer).
4. Date dimension
4.1. Date aggregated dimension (grouping of days) usually encompasses inconsistent time period. These periods should be modeled as independent hierarchy structures. Period not having a fixed number of days (e.g. week period as defined in Oracle) should have a Period_size indicator for averaging.
4.2. (4.1 continued) This offers the advantage of having a single entry point (one dimension to create) for storing all various independent period structures such as week-year, month-quarter-year, holiday-year, etc.
5. Exceptions in dimension
5.1. Dimension-scope exceptions are applicable whenever there is a hierarchy level attribute not applicable for the specifics dimensions member. This attribute must take the exception value.
Martin
Friday, February 21, 2003
Data Warehouse
My professional work and interest have long been involving the creation and modeling of very large databases designed for reporting and analytical needs called, data warehouse. These systems also referred to as OLAP or On-line Analytical Processing (a term coined after its couterpart called OLTP, On-Line Transaction Processing in which the data is volatile and transaction-based) can be implemented in multiple ways:
These systems complexity arise mostly from its data source characteristics (high volume, high heterogeneous, medium quality, etc.), from its performance expectation (should report millions of facts in split second! ), from its multi-department initiative and often misaligned requirement (in the case of corporate warehouse), from its challenging selling point (supposedly can report on any data produced within the company), and from its vast technology offer on the market.
For those interested more on this subject, I would recommend books from Ralph Kimball who has developed some interesting theory taking very pragmatic approach.
In the next few notes, I'll be sharing some Guidelines I've written after experimenting with modeling real-world datawarehouse using the relational paradigm (i.e. ROLAP). These notes will give unstructured tips and hints for building dimensional model into relational database. Although my notes are based on real experience done on projects implemented in Oracle, they are far from being exhaustive.
Martin
- ROLAP which uses a relational engine (exactly like DBMS used in OLTP, e.g. Oracle)
- MOLAP using a multi-dimensional engine (e.g. Cognos PowerPlay)
- HOLAP which is a combination of the two approach often keeping the atomic fact data in the relational database and build the aggregation for each dimension inside a MOLAP cube (e.g. Microsoft Analysis Services)
These systems complexity arise mostly from its data source characteristics (high volume, high heterogeneous, medium quality, etc.), from its performance expectation (should report millions of facts in split second! ), from its multi-department initiative and often misaligned requirement (in the case of corporate warehouse), from its challenging selling point (supposedly can report on any data produced within the company), and from its vast technology offer on the market.
For those interested more on this subject, I would recommend books from Ralph Kimball who has developed some interesting theory taking very pragmatic approach.
In the next few notes, I'll be sharing some Guidelines I've written after experimenting with modeling real-world datawarehouse using the relational paradigm (i.e. ROLAP). These notes will give unstructured tips and hints for building dimensional model into relational database. Although my notes are based on real experience done on projects implemented in Oracle, they are far from being exhaustive.
Martin
Web site journey
I've hesitated probably too much time in trying to decide which form should this site be in... (should it be a blog, a simple and static personal page, or a fancy dynamic CMS (Content Management System). I started looking for blog system in the open source (ok I'm biased toward using Java since it is the language I have experienced with and also probably too lazy in learning an easier one such as PHP ;-). So I've installed and played a bit with tools such as Pebble and Roller Weblogger.
These tools worked perfectly fine and seems to be well designed so I even posted my first few notes using these. Later I decided to look for a more organized site structure with clearer hierarchical content form than what a weblog, with its calendar-dependent organisation, could offer. I realized that the concept of CMS (content management system) offer very convenient and powerful way to keep a dynamic personal site. So I've checked and found out that there are literally thousands of CMS only looking at the open source ones, but lucky for me most of them are PHP-based, so they were quickly eliminated ;--))! I even installed Magnolia for a couple of months and created a great deal of content.
However, I'd assumed from the beginning that hosting these applications would be a simple matter of installing a Tomcat server on my desktop and loaded it up with the web app! Although this turns out to be true, the biggest obstacle was not related to software installation and configuration but rather to hardware networking: my upload Internet link is simply way toooo sloooow to encourage even the 14.4Kb modem users to check my site! (without mentioning the trouble with having a 24/7 server in my basement...).
I could have looked up for an web hosting with Tomcat installed (there's plenty of them), but then I decided to simply go the easy way and registered with Blogger!
Martin
These tools worked perfectly fine and seems to be well designed so I even posted my first few notes using these. Later I decided to look for a more organized site structure with clearer hierarchical content form than what a weblog, with its calendar-dependent organisation, could offer. I realized that the concept of CMS (content management system) offer very convenient and powerful way to keep a dynamic personal site. So I've checked and found out that there are literally thousands of CMS only looking at the open source ones, but lucky for me most of them are PHP-based, so they were quickly eliminated ;--))! I even installed Magnolia for a couple of months and created a great deal of content.
However, I'd assumed from the beginning that hosting these applications would be a simple matter of installing a Tomcat server on my desktop and loaded it up with the web app! Although this turns out to be true, the biggest obstacle was not related to software installation and configuration but rather to hardware networking: my upload Internet link is simply way toooo sloooow to encourage even the 14.4Kb modem users to check my site! (without mentioning the trouble with having a 24/7 server in my basement...).
I could have looked up for an web hosting with Tomcat installed (there's plenty of them), but then I decided to simply go the easy way and registered with Blogger!
Martin
Tuesday, February 18, 2003
First Weblog
This blog contains some information concerning me (Martin Ouellet), my professional interest, a little of my personal interest, and any other subject that I'm willing to share to the entire world (not literally, but this becomes a potential reality with Internet...).
Why did I choose to publish this site, first and foremost because I'm very interested in new technology and more specifically the Internet, Java, Data Warehouse, Data mining, Telecom, Software development in the open source community and this site simply re-iterates this interest.
Secondly, because being away from my family and friends from time to time (I've moved back and forth from/to Canada and Europe in past years) this allows me to stay in contact with those that count most to me.
You'll see in the attached image myself walking with my first child and enjoying my time in Lausanne (directly back from me was my former apartment in "la cité" in the heart of the old town).
Subscribe to:
Posts (Atom)