Saturday, March 08, 2014

Anchor modeling (part-2)


In previous post, I describe the Anchor modeling approach.  Here I will attempt to comment on its merit and providing a very small example of a data model .

My take on it:

Although, I never implemented AM on a large scale project myself, I can certainly see a few points that are very attractive at first glance.  Here are some that quickly come to mind :
  • The model prescription really goes up to the 6NF (except when we decide to include multiple non-identifying Anchor FK or knotted attribute into a single Tie).  This 6NF implies that no other attributes are stored along with the Anchor surrogate-key, not even the natural-key(s).  Potential benefits are :  
    • Improved maintenance when dealing with evolutive or unstable natural-key.  This follows the principle of model extension where we can accommodate new or change of natural keys in the future.  
  • The modelling prescription details specific and strict rules .
    • Advantage: This reduces the likelihood of our model to go “wrong”.   As a data modeller you are offered much smaller degrees of freedom, at least after you have defined Anchors and Ties.  Modelling Tie may be more flexible, as it seems, you are free to add more or less non-identifying keys in the Relationship. 
  • The model guideline recognises the importance of State often describing relationship (these sort of state are always mutually exclusive and exhaustive).  
    • Advantage: Tie can be Knotted for storing this data in a straightforward way with no need to come up with external structure besides the Knot.   In DV, you’d need external Satellites for storing the attributes and their time-segment.  This feature is convenient as in real-world project we often see this pattern where relationship goes through some form of lifecycle changing state.  
    • Disadvantage:  The flip side of this is that once your relationship is built like that, then it becomes less flexible and can hardly accommodate new stuff...In this respect, DV Links modelled as immutable intersection of Keys offer better longterm flexibility.
  • The model guideline also recognise the importance of having at most one key, called Role, in the relationship outside the Tie key identifiers.  
    • Advantage: For Historical Tie, this makes explicit what changes have triggered an update in the relationship.
  • The model guideline recommends keeping Tie’s width small (#of Roles).   In practice, we should try breaking down larger Tie into smaller constituents.  This helps in regard to : 
    • handling of late arrival facts (or asynchronous) which would delay the capture of the relationship as recording can only happen once all roles are known;
    • decreased data redundancy generated from historised relationships with frequent changing state/property.  In this scenario, lots of nearly identical rows arise since all roles, except the non-identifying state/property role, stay identical.
    • increased stability:  larger relationship are more likely to become deprecated rendering your AM model less evolutive with more frequent replacement of existing Tie with new ones.
  • The Knot is a primitive modeling construct holding important referential data:
    • Advantage: static set text values are easily recognised.  We can avoid disseminating these text values, typical of lesser normalised structures, and thus reducing the data redundancy throughout our data model.
  • The model guideline proposes a clear separation between metadata and data.  Auditing info is stored and maintained externally from data into their own metadata structures.  Metadata is referred to by FK pointing to a global Audit table in a standard and global way.   
    • Maintaining data separate from metadata leads to cleaner data structures and less chance of misinterpreting timestamp data fields.

However an AM implementation has its drawback, the most important one being that we probably need a complete toolset and a separate access layer to help us manage the explosion of data structures (table, views..).  No one would want to interact directly at the physical level any real size AM implementation.

Other important issues could come from DB vendor's limitation which may not support some functions required (ex. Table Elimination is almost mandatory for query optimisation).   Bottom line, you certainly need a lot of experiment before leveraging this technique, and see if the additional complexity can be managed and mitigated in a large scale implementation.


A small Example:

I conclude this post by presenting a small example.   Using the available modeling tool, you can quickly start creating your own model.  So let’s imagine a DB used to collect any info, statistics, indicators/ metrics that are produced about countries and cities in the world.   As these data would need to be refreshed periodically it is important to capture the history as well as sourcing, format, and other meta-info.

The diagram model (each entity type has dedicated appearance) presented next, gives the overview of the data model.  Red squares are used for representing Anchor, circle for Attribute, rounded square for Knot while grey diamond-shaped represents Tie.   Optionally, we may choose to keep Historisation for Tie or Attribute, in which case we have outline represented as double-line.

General view of a simple AM model



The naming and mnemonic convention is used to automatically provide physical tables names (as well as views and functions).    We can toggle between mnemonic or real name in the diagram.

We see below the two main Anchor: City and Country naturally linked through a Tie, as well as with a few other Anchors tied together.
Model zoom-in around Hubs Country and City


Let’s suppose that during the lifetime of our Data repository, we had decided to keep as natural-key the ISO ALPHA-2 (2-letter code).  This was used as the country look-up with the help of Natural-key View.  However later on, we realised that a  lot of data sources actually used the ISO ALPHA-3 (3-letter code) instead as country identifier.   No problem here, our AM model supports transparently the addition of any sort of Attributes.  The only impact is that we now have two Natural-key View to choose from for the look-up.

We later discovered with disbelief, that our natural-keys are NOT immutable… quite annoying for a database key.   As always, this seemed easy to know in hindsight, but at the time who could have guessed that our ISO standard key will betray us.   Our data model must then allow for natural-keys evolution in time!  Again no problem, our AM model has “Historisation by Design” built-in, and the only impact is the addition of a new field for temporal validity inside the natural Attribute tables.  We would also need a slight change to the "Natural-key View"  to accommodate for the time point entry necessary to manage the new look-up logic.

Important note: had we kept our natural-key(s) along the surrogate, these evolutions would have involved breaking existing elements of our model!  This illustrates the benefit of higher normalisation with AM.

Conversely, from our small business case, we clearly see where the 6th NF could cause us real harm… If we start adding new each indicator as one Attribute around the Country and City (and historized as each can be updated in future), there will quickly be an explosion of tables making the entire data model unsustainable!!

To avoid this pitfall we make use of data model generalization principles and create more abstract entities.  So in place of adding new tables to store values for each new indicator, we turned these tables into rows.   In our simple scenario, this becomes possible by defining new abstract entities :  Indicator and the Indicator-instance.      Each Indicator has a clear identity (Anchor: Indicator) and is described by its name, its definition, its source (i.e. its surrounding Attributes).   The Indicator-value is represented as an anchor (IndInstance), whose identity correspond to the Indicator-value given for a particular country (or city) and at a given year.   One  Attribute holds the exact value while a Tie is needed to tie everything together (the indicator, the country and the year-period).

Model entities for Indicator and their instance values


More meta-info could be used in order to locate the information sourcing on the web, and other info related to format, document type.


Martin


No comments: