Friday, March 01, 2013

Data Vault model: Mobile Telecom example (part-3)

Here's my last note concerning "Data Vault model: Mobile telecom example" where some points and issues are discussed.


Open Points and Discussion

→Business Key inconsistencies
Among all OLTPs used as source, it is very unlikely that they will all share exact same Key(s) to represent same business entity !   If we are lucky, keys will vary in format only, but more frequently, keys can be totally different.

This challenge is referred in the DV community as the "Business Key Alignment", where raw keys coming from OLTPs must somehow be integrated (and thus aligned) toward the universal EWBK.
Next point discusses these points for the most important Key of all Telco:  Subscriber Business Key.

→Subscriber identification
In telco, any subscriber is uniquely identifiable through its phone number (MSISDN) but only at given point in time (subscribers can change phone number!).  This makes MSISDN only a valid temporal-based Key.  We are normally more interested in tracking people than number,  then we certainly need an alternative business key.   Billing systems already accounted for these changes of numbers, so these should be used as the Master Data Source for our Subscriber Hub’s Business key.   Other transactional systems in telco, will often make use of temporal-based keys, that is fine in for transaction-based context that always refer to a point in time.   Another example of this is the IMSI which is the unique subscriber number soft-wired at the SIM card level, and somewhat more stable than MSISDN (still an issue when subscriber loses his SIM).

These points suggest important properties needed for Hub Business Key:
  • Business Key should be immutable
  • Business Key should come from one Master Data Source
  • Business Key may not always be well recognized by business users, especially not at the sacrifice of immutability.
So here, we decide to choose as Business Key, the Subscriber-Id having these properties, and maintained in our Telco Billing platform (Master data source).

Now, how should we model the other mutable Keys?  They will likely be needed while we integrate other OLTPs platforms which rely on them.   To be able to map transactions, links coming from these others platforms, we need to record the state of the subscriber mutable keys (ex. MSISDN) according to the chosen Hub key.   As such, we propose to hook them up through the use of dedicated Satellite giving their key values at any point in time (given a Business Key).
Fig Data model around the Subscriber Hub


→Satellite Design Criteria
When it is time to model Satellites around each Hub we have quite a lot of freedom.   We can think of a spectrum modelling style where we have at one extreme One Satellite per Hub, and at the other extreme One Satellite per Attribute (from completely denormalized to highly normalized).

We can consider these criteria for modelling guidance :
  • Rate of Change (group attributes of similar life-cycle)
  • Types of Data (group attributes of similar nature)
  • Record source system (group attributes by their source) &nbsp

However, I would add another important criteria: Clarity and Importance.  It is well accepted that some attributes play a much more crucial role than others.  Take for examples, Subscriber which are commonly used to derive important downstream semantics, like business logic, segmentation, grouping, etc (compare these with Attributes such as Address-data like Street Door-no).

So for these crucial Attributes, we model them in their own separate Table so that we can clearly spot them while browsing our DB schema metadata (see ex. SAT_SUB_STATUS in figure 2).  And more importantly, mixing attributes will lead to additional records (due to historical change tracking).   Our satellite will time slice against attributes of interested, and thus optimized for faster query.


→Call or Network Event
How should we model CDR’s data?   Two issues arise when one wants to model these :

  1. Business keys that can possibly identify each record not straightforward (complex and also prone to changes in time)
  2. High data volumetry and burst rate may impose additional constraint on modelling

Can we classify a Call (Network event) as being a recognized business entity?  It has clearly some characteristics, however it cannot easily be identified through limited business-keys.   One could imagine that the following key combinations:  1) Calling MSISDN, 2) Called MSISDN and 3) Event_start_date.

These seem to satisfy the uniqueness criteria (at least for scenarios such as receiving voice mail at same time as making another phone call, etc...), but things get much more complicated now with the introduction of smart phone and their multitude protocol and access mode (HTTP, email, VoIP, VPN, ftp, Peer-to-Peer, TV streaming...).   So to me, the Network Event cannot be modelled as a Hub, but rather as a pure Link.

Now, concerning data volumetry & burst rate.  If we model these as Link and normalize all contextual data into numerous Satellites, we end up with many join operating on extremely large tables !  Furthermore, we can’t even take advantage of star join structure (with much smaller dimension tables having bitmap indexes surrounding the big fact table), as all satellites have the exact same size as the Links tables.  In CDR’s context data are merely static, so there should not be any need to design Satellite to keep track of change in time.

 SO here I'd argue we can loosen Link modeling principle and design what is called a Transaction Link, I think it makes perfect sense for our CDR’s scenario, as illustrated below:

Fig Network Events are modelled as a Transaction Link 
Also worth mentioning is that most relationships are optionals, this is typically true when integrating multiple source systems (rated calls from billing, unrated from mediation, etc.).  Depending on the event context, and there may or may not be a Rating_plan (unrated calls like incoming have none), Cell-id, Called/Callee are both required only for more traditional calls/sms, Cell-Id (caller may be roaming outside telco network), etc.    I’m not sure we should model these missing relationships as done with dimensional modelling (specific exception records exist in dimensions to avoid the missing Null FK’s).    Here, I’d recommend to simply leave the Null FK which reflect the way things are recorded at source system.


→Relationship with lifespan (temporal-variant)

DV models Link as (timeless) existence of a relationship, hence they record any association as seen from source, similarly as we do with Hub key.    However, some relationships are meant to have limited lifespan scope.   It then becomes critical to keep track of this lifetime as it validates the time period the relationship holds true.  Not much is said about these.... could however find the following (from this cheat sheet online resource):
"We can end-date links using an ‘exists’-attribute in a satellite"

Therefore, we propose to create a Validity lifespan Satellite data around the Link entity to keep track of time period.   This is similar to contextual data around Hub which are always “ time-based”, except now these time-based is no longer metadata but rather functional, i.e. correspond to contextual data of the Links (in fact for this particular Link, no info is conveyed except the relationship itself and its time period validity).   In the Validity Satellite, the attribute Valid-from and Valid-to must correspond to time period, however this rule is not enforced here in case the Billing source-system its business rule in future (subs only has one Rating Plan at any time):

Fig Link between Subs and RatingPlan with limited timespan

→New Hub or Satellite attribute 

What justify promoting a single Attribute into a recognized and separate Hub entity?  In theory, we learnt that Hub must be an entity recognized in the organization with clear business key(s).  However, there are quite a few attributes that have these characteristics: Retail plan, Account type, Promotion, Campaign, etc.... identifiable code associated to them).  So I believe we need clearer Design criteria providing guidance to determine what can remain simple Attribute vs what needs to be promoted as a new Hub.  I propose these simple criteria:

  1. Pervasiness: Does the Attribute appear in many different source systems/contexts. If yes, then consider the Hub so that we minimize the negative impact of having non-standardized and non-conformed codes spread throughout the EDW data model. 
  2. Relationship: Can the Attribute be used in other relationship than its unique associated Hub. If yes, then consider the Hub so that we can capture these relationship and association elsewhere through the standard way : Links. 
  3. Evolution: Maybe, a data source newly integrated includes an important business concept still only used, for now, in isolation. If yes, consider the hub as newly fashionable concepts tend to proliferate quite rapidly these days...


Regarding these new design criteria, it is easy to see why an attribute like Retail-Plan (the identifiable code), was modelled as a Hub, since it is frequently used in many contexts (relate to Subscriber, relate to Campaign, can be found in other system, etc.).  However, one could state that Address have same characteristics, so that it could as well be a Hub.... yes but unfortunately it rarely meets the 1st Hub design criteria: unique and easily identifiable Business key (unless your a shop specializing with addresses, do you have available in your system a clear and unique Business key for Addresses, most likely not?).

Martin