Tuesday, February 25, 2003

Data Warehouse Modeling guideline Part 1: Dimension

Here's the first part of my ad-hoc modeling guideline listing general advice as well as more specific ones related to dimension. Please bare with me all this "techie jargon", it is mostly stolen from R. Kimball terminology in dimensional data model. This guideline assumes a good understanding and knowledge on datawarehouse concept and data modeling.


Atomic Data Modeling (in the sense of most granular transactional data)

General

1. Limit the first implementation to a minimum number of star schema and number of source data. Usually a first implementation would involve a single transaction table accompanied with its snapshot state.

2. Avoid the use of generic terms for naming dimension and columns, especially at the start of the implementation process of datamarts.

3. Try to impose as many constraints as possible ("Not Null") for all attributes upfront and not after the creation of the table. Fact measure or Foreign Key should never be equal to Null to facilitate indexing and ETL processing.

4. Avoid the mix of transaction based and snapshot based fact records inside the same fact table. This usually results in a variable time interval fact with two links with timestamp: an Effective date and an End date, the effective is usually triggered by an event while the period defines a state which finish at the End date. This generates a number of limitations such as: 1- difficulty to evolve the model with new dimension, as when we add any changes this affects the previous history of record, 2- time axis does not have a unique definition in reporting tool such as BusinessObjects, 3- query of type EOP which counts number of record having certain status are usually slower and more cumbersome in SQL.



Dimension Entity

1. Junk dimension

1.1. Use junk dimension to gather various types of unrelated and unstructured information that would generate too many mini-dimensions.

1.2. Do not use junk dimension when data is volatile and/or the domain of information covered by these tends to grow.

1.3. Junk dimension can sometimes be used when there is a need to tie multiple dimensions into a large volume star schema that cannot easily accommodate for new dimensions.

1.4. Junk dimension usually needs to be accompanied with database View that make more sense for the business user (can be done at reporting tool level as well)

1.5. Junk dimension can sometimes be used to support a many-to-many relation with the fact. This is only recommended for a dimension having the many-to-many relationship upper bounded with a relatively small number (i.e. the number of columns necessary to have in the dimension is small and static).


2. Big dimension

2.1. Try to split this dimension into smaller dimensions, otherwise the SCD type II is unrealistically implemented.

2.2. (2.1 continued) For example, customer can often be splitted into entities related with: Identity, Account, Demographic/Geographic and Value-based derivable attributes.

2.3. (2.1 continued) The relationship between these new dimensions could be evaluated as fact, i.e. plugged directly into a history fact table (e.g. customer relating to demographic/geographic) or directly linked with snowflakes (usually the link is SCD type I).

2.4. Big dimension often hides multiple smaller dimensions.


3. Interrelated dimensions

3.1. Do not store attributes from external dimension into another dimension (e.g. customer). These usually represent relationship (usually current situation) that should fundamentally be evaluated through a fact.

3.2. (3.1 continued) Storing these relationships into dimension only creates data redundancy and is prone to data inconsistency.

3.3. (3.1 continued) If no history is needed from those relationships and no detail is necessary (i.e. there is no need to evaluate them as fact), then all dimensions should be merged into the most significant one.

3.4. (3.1 continued) Compromises must be taken between the size of the dimension vs the number of smaller dimensions (see Big dimension tips). This recommendation can have exception, especially when relationship between the two dimensions does not change and that the link between the two is difficult to obtain through the fact (e.g. Customer relating to Dealer). For this exception, it may help and improve performance for aggregation building or query restitution to duplicate the link into the most significant dimension (e.g. customer).


4. Date dimension

4.1. Date aggregated dimension (grouping of days) usually encompasses inconsistent time period. These periods should be modeled as independent hierarchy structures. Period not having a fixed number of days (e.g. week period as defined in Oracle) should have a Period_size indicator for averaging.

4.2. (4.1 continued) This offers the advantage of having a single entry point (one dimension to create) for storing all various independent period structures such as week-year, month-quarter-year, holiday-year, etc.


5. Exceptions in dimension

5.1. Dimension-scope exceptions are applicable whenever there is a hierarchy level attribute not applicable for the specifics dimensions member. This attribute must take the exception value.

Martin

Friday, February 21, 2003

Data Warehouse

My professional work and interest have long been involving the creation and modeling of very large databases designed for reporting and analytical needs called, data warehouse. These systems also referred to as OLAP or On-line Analytical Processing (a term coined after its couterpart called OLTP, On-Line Transaction Processing in which the data is volatile and transaction-based) can be implemented in multiple ways:
  • ROLAP which uses a relational engine (exactly like DBMS used in OLTP, e.g. Oracle)
  • MOLAP using a multi-dimensional engine (e.g. Cognos PowerPlay)
  • HOLAP which is a combination of the two approach often keeping the atomic fact data in the relational database and build the aggregation for each dimension inside a MOLAP cube (e.g. Microsoft Analysis Services)

These systems complexity arise mostly from its data source characteristics (high volume, high heterogeneous, medium quality, etc.), from its performance expectation (should report millions of facts in split second! ), from its multi-department initiative and often misaligned requirement (in the case of corporate warehouse), from its challenging selling point (supposedly can report on any data produced within the company), and from its vast technology offer on the market.

For those interested more on this subject, I would recommend books from Ralph Kimball who has developed some interesting theory taking very pragmatic approach.

In the next few notes, I'll be sharing some Guidelines I've written after experimenting with modeling real-world datawarehouse using the relational paradigm (i.e. ROLAP). These notes will give unstructured tips and hints for building dimensional model into relational database. Although my notes are based on real experience done on projects implemented in Oracle, they are far from being exhaustive.

Martin

Web site journey

I've hesitated probably too much time in trying to decide which form should this site be in... (should it be a blog, a simple and static personal page, or a fancy dynamic CMS (Content Management System). I started looking for blog system in the open source (ok I'm biased toward using Java since it is the language I have experienced with and also probably too lazy in learning an easier one such as PHP ;-). So I've installed and played a bit with tools such as Pebble and Roller Weblogger.

These tools worked perfectly fine and seems to be well designed so I even posted my first few notes using these. Later I decided to look for a more organized site structure with clearer hierarchical content form than what a weblog, with its calendar-dependent organisation, could offer. I realized that the concept of CMS (content management system) offer very convenient and powerful way to keep a dynamic personal site. So I've checked and found out that there are literally thousands of CMS only looking at the open source ones, but lucky for me most of them are PHP-based, so they were quickly eliminated ;--))! I even installed Magnolia for a couple of months and created a great deal of content.

However, I'd assumed from the beginning that hosting these applications would be a simple matter of installing a Tomcat server on my desktop and loaded it up with the web app! Although this turns out to be true, the biggest obstacle was not related to software installation and configuration but rather to hardware networking: my upload Internet link is simply way toooo sloooow to encourage even the 14.4Kb modem users to check my site! (without mentioning the trouble with having a 24/7 server in my basement...).

I could have looked up for an web hosting with Tomcat installed (there's plenty of them), but then I decided to simply go the easy way and registered with Blogger!

Martin

Tuesday, February 18, 2003

First Weblog


This blog contains some information concerning me (Martin Ouellet), my professional interest, a little of my personal interest, and any other subject that I'm willing to share to the entire world (not literally, but this becomes a potential reality with Internet...).

Why did I choose to publish this site, first and foremost because I'm very interested in new technology and more specifically the Internet, Java, Data Warehouse, Data mining, Telecom, Software development in the open source community and this site simply re-iterates this interest.

Secondly, because being away from my family and friends from time to time (I've moved back and forth from/to Canada and Europe in past years) this allows me to stay in contact with those that count most to me.

You'll see in the attached image myself walking with my first child and enjoying my time in Lausanne (directly back from me was my former apartment in "la cité" in the heart of the old town).