Wednesday, July 08, 2015

Anchor model + Data vault

I already discussed about data modelling techniques promoted by Data Vault (DV) and Anchor Modelling (AM) approaches.   Now what if I were to borrow ideas from both techniques in order to come up with a different model.  This post describes exactly this,  with the aim to improve on somewhat contradicting criteria:  flexibility and coherence!

Note that this proposal is subjective and based on my preference and experience in data warehousing architecture.  At the end, the resulting data model could go either way as it would yield very close solution.  The fact that both have reached similar modelling techniques while originating from independent and different background is a testimony of their sound principles.

This post uses terms from both DV and AM to name data model primitives according to the relevant context.  For anyone not familiar with both context, here is the name DV vs AM terms matching:

Hub       =  Anchor   
Satellite =  Attribute
Link      =  Tie      
Ref       =  Knot     


1. Naming convention

For naming, I'd adopt the easier and more friendly DV style.  Although AM proposed a completely formalised and rigid convention, I find expressiveness and clarity to be important virtues when it comes to naming things.   AM naming convention is good with objectives like full automation and avoiding collision, but it turns out that the use of mnemonic and descriptive text result in long and cryptic column and table names.   DV style is quite soft in this respect with a simple annotation (suffix or prefix) of table based on their primitive type.

Preference:   DV (favour simplicity over rigidity)


2. Audit / Metadata information

In this respect, I'd rather use AM approach that cleanly separate all ETL-related metadata from real data.  So each data element have a FK referring to its metadata stored in separate table (for things like data source, load timestamp, job details, ...).   Mixing metadata & data blurs the line between the two, and could mislead users to confuse temporal validity of attribute  (when a new value became valid vs when did we learn about this new value).

Preference:   AM (get the metadata out of the model)

3. Reference data 

Reference data in DV seems to be more of an after-thought than a valid 4th model type.   Ok this is a subjective opinion, but in AM Knots are official primitive type that even impact the name of the underlying primitive (ex. a knotted-attribute, a knotted-tie? ).   The way I see it, there is a good motivation to account for low/fix cardinality and immutable descriptive reference data.  

Preference:  AM (make reference data first-class citizen)

4. Temporal Relationship 

Links in DV are multi-way relationship identifiable through the combination of all referred Hubs. These are recorded as unique tuple of FK's with the addition of artificial surrogate-key acting as the PK.  In contrast, AM's relationship (Tie) is uniquely identified by some set of Anchors, but in addition, we can have optional non-identifiable Hub (or Knot) for which we need to keep historical changes.  As eluded in previous post, this adds the advantage of supporting temporal relationship directly in Tie, however, with the cost that Tie can no longer support over-hanging Attributes.

I'd argue for increased flexibility: let's support both styles and apply the most suited one for our use-case.   For ex, let's apply DV style when we need to keep track of attributes inherent of multi-way relationship.  In other words, when there is no single Hub that uniquely identify the relationship, avoid creating an artificial Anchor just to support Attributes as typically done in AM.   In contrast, let's leverage inherent temporal features of AM style when relationships go through various mutable state that we need to keep track of, or when the relationship has a unique Hub identifier.


Preference:  DV-AM (allow for more flexibility with Relationship)


5. Time Point or Time Segment 

When we need to keep track of Attributes changes, AM 6th NF dictates no data redundancy such that only a single time point is stored.   In DV, the full time segment is stored which requires an "update" operation during loading to close preceding value.   Advantage of AM is the more elegant and optimised "insert-only" strategy, while DV favours the cleaner and easier query logic and execution.   Here, I 'd say either approaches because it is mostly based on the physical DB engine capability.  I then make the DV time segment not mandatory but optional:  only due to constraint originating from your DB engine implementation.

(I like this idea that every batch load will only bring new values, but ...)

PreferenceAM (use a single time point if you can afford it)


6. Business Key  

In DV, Business key (aka natural key) are considered immutable and stored along the Hub data.  No such assumption is done with AM, and arguably this offers better longterm flexibility (see my previous post).  Here I definitely adopt the AM style and use a dedicated Natural-key stored in an external Attribute table, as it accounts for more situation down the road....  alternate nat-key replacement, nat-key temporal mutability, etc..

There are also cases for weak-entity (i.e. entities having PK coming from other Anchor) ... also discussed about nat-key View and .

PreferenceAM (most things are mutable, so avoid immutability assumption for business key)


7. Satellites or Attributes normalised form   

In AM, 6th NF imposes a very strict rule: 1 Attribute = 1 Table.  I clearly see the advantage of this for temporal-type attributes.  However, I'd rather use the more flexible DV approach where the rule is more relax here...

Aside from the table explosion issues, I see two arguments in favour of regrouping some attibutes à la DV style:

  • Some Attributes are not meant to be historised.  (for ex, most transactions are atomic and immutable, so that change in state can only happen in a subsequent or corrective transaction)
  • Some Attributes are purely functionally dependent on other attribute (ex. a code has often a descriptive tag attached to it) 
PreferenceDV (favour flexibility and allow for one or more attributes) 


Martin



1 comment:

Unknown said...

data modelling techniques promoted by Data Vault (DV) and Anchor Modelling (AM) approaches by Layman Learning - Data Modelling Training Online – LaymanLearning
MOB: +91-741-626-7887 - Website : www.laymanlearning.com || hr@laymanlearning.com ||