Friday, June 12, 2009

Star Schema Limitation

The star schema is probably the most over-used data modeling paradigm used today in data warehousing. It is very attractive due to its simplicity (especially important for business) and its pervasiveness, and as such people has the tendency to overuse it at every layer of the BI data platform.

There is one use case where its use quickly become problematic and causes more harm than benefice. This is when we want to track-down all slowly changes occurring around the Customer of the enterprise (this term is used loosely here, as it can mean different thing such as subscriber, client, account, etc..). The slowly changes referred here must not be confused with simple attributes (e.g. customer age, name, ...) usually handled within the dimension itself using specific slowly changing dimension rules (SCD type 1,2,3,.., 6 and still counting). Here, we refer to important Indicator information usually spread out in separate Dimensions or other data structure.

More specifically, these slowly changes can be comprised of very general Indicator characterizing the Customer either directly obtained from operational system (such as account type, price plan, subscriber status, address, dealer and channel, etc) or features derived/calculated from data within the integration layer of the BI platform (such as customer value, churn likelihood, behavior, customer segmentation, etc).

The star schema can be used for this goal: we simply design the model around all Dimension representing the features being tracked down, along with the dimension Customer and two links to Date/time Dimensions in order to define the validity period of the fact row (the start and end date). In other words, each fact row represents a new state for the Customer in time and is created whenever one or more slowly changes (characterized by dimension) have occurred.

This approach offers a more or less standardized approach in BI (star schema model) allowing a very simple Query logic with multiple dimensions (indicators) to be reported simultaneously. However, this advantage is trade-off by a number of important negative issues :


  • Flexibility: Physical data model is not amenable to easily permit changes in terms of new dimension/indicator as this requires a new version of the model every time (people working in large organization will surely appreciate the implication cost of changing data model inside database in Production!!).
  • Extensibility: There is a limit on the number of dimension/indicator that can be added into the structure as this requires a significant cost in terms of disk storage for every new dimension/indicator. The star schema is simply not designed to track down a variable number of indicators dynamically.
  • Volumetry: Data structure leads to a great deal of disk storage waste as dimension/indicator are duplicated in all rows even when their values remain unchanged (ex. Dealer or Geography indicator are very unlikely to change throughout the Customer history but their values will be repeated for each and every new Customer fact-row state).
  • Clarity: This particular model mixes two concepts into a single fact table (1. event-based fact and 2. snapshot-based fact representing the state at fixed point in time) blurring its role and increasing query complexity for all EOP, BOP type of reports.
  • Loading: The ETL feed is likely to be more complex with this type of structure, which results that only minor changes are done over time and typically no new information element is rarely added.


There are better solution that we can model in order to provide such typical historized view of the Customer evolution in time. One such solution is presented next with a simple logical model:

This solution is sometimes dubbed 1-dimensional model (because everything evolves around the generic Indicator table), or I've seen alternative terminology like this one ... 
The more relational inclined will note that this model is flawed as the effective and expiration date are duplicated info (the expiration date is directly linked to effective date of the subsequent row for a specific customer.. ) and as such should be constraint to ensure integrity by the DBMS engine.  This is a good point and can simply be avoided by removing one of the field (see one good example), but to me it is a physical implementation detail:  should the data model enforce this rule or could we trust the SQL feeding codes to handle it (obviously more risky).  To keep things simple, I've decided to only show the logical data model, leaving this implementation choice as a side-note.    

The table 'Customer Indicator Association' is central to this data structure and is similar to the Fact table whose role is to capture all association customer-indicator in time. Within this context, the Indicator table corresponds to dimension records found in various dimension table, whereas the Indicator Type corresponds to the actual Dimension.

The pre-requisite of this data model is that all Customer must be only associated to only one Indicator value by Indicator Type at any point in time. Some other important aspects of this data structure is summarized here:
  • Each time state of Customer would change for one Indicator value, we would generate a new fact row with the corresponding Effective start date (and Expiration from date set to some infinite key date) and the previous state would be expired (update done on the expired to).
  • Only necessary changes are tracked down, .... meaning a Customer whose state never changed from Activation (for a specific Indicator) would only have a single fact row for this particular Indicator.
    Date (start and end) correspond to real date when the change of indicator occurred: usually based on operational system date (for Indicator obtained from these systems) or business rule for Indicator derived from integration layer (e.g. Customer Value re-calculation period).
  • No elapse time period is allowed between the previous Expired date and the current Effective Date. This ensures data integrity state to state, so that



Key implementation decisions that need to be chosen with this data structure :
  • How to align all Indicators once a new customer is activated. Two options are possible: concurrently aligned all Indicators with the first Start date even if some Indicator still has no value, or yet allow for Indicator in time.
  • How to close Indicators once a Customer cancels all activity (no more customer). Again, we can align all Indicators with a transition state such as 'no longer customer', or leave with the latest Indicator value before cancellation occurred.
  • How to treat technical changes of Indicator (i.e. it is not rare in operational system that Customer change status within a short period of time to return to original status... technical changes without real consequence). A grace period can be defined to account and ignore for technical changes.


The indicator association table will easily supports the full historization of all slowly changing Customer related indicators (existing ones, new ones, modified one and future ones) without any model change. This has the potential to become the central and only place in your BI data platform where to hold atomic indicators for Customer.


Obviously, this solution will present the flip-side disadvantage of the more traditional Star schema solution. The most important one aspect concerns the Reporting extraction logic which is more complex and less optimal. In essence, this structure is not aimed toward end-user Reporting since more complex and far less performant Query logic must be employed for data extraction (this is somewhat mitigated with the use of latest SQL analytical functions). This is especially true since most query involve multiple predicate conditions imposed on a number of dimension. So usually, a downstream data structure would exist more geared toward end-user query (a sort a flatten version where all Indicators value would be transposed back to columns...).

Martin