Friday, January 10, 2014

What is Anchor Modeling? (part-1)



Anchor Modeling (AM) is another approach well suited for modeling the integrated/consolidated data layer within an enterprise data warehouse.  AM was originally created by Olle Regardt with formalisation done by Lars Rönnbäck.  It offers an open source database modeling technique accompanied by an online modeling tool available through an MIT license.  Most info is maintained around AM home.

The approach has some theoretical background with ideas borrowed from the information modeling world.  More specifically it is inspired by a method called Object Role Modeling (ORM) developed by Dr. Terry Halpin.   ORM, not to be confused with the Object Relation Mapping, is a modeling notation designed to help non-expert doing conceptual modeling of database schema.

The theoretical formalisation has also helped provide a very strict and prescriptive methodology compared to some more ad-hoc approaches used in the world of data warehouse (often heuristic-based developed with time and experience).  

It also offers some implementation ecamples, but at this time, the only complete implementation is ported onto the Microsoft SQL-Server engine.


Comparison to Data Vault

I will not go into details as to what is Anchor Modeling here.   Instead I will describe the similarities it shares with Data Vault (DV),  since both aims to decompose any data model into smaller and more basic constituents.    For those interested in detail description, please refer to this excellent article Anchor modeling - agile information modeling in evolving data environments written by its creators.  


1. Anchor ≈ Hub

  • Both Anchor and Hub store Entities having strong identity.   There is an important difference:  in AM, an Anchor stands completely on its own, i.e. it only stores the surrogate-key (aka technical key) so keep natural-key (aka business-key) in separate Attribute table.   
IMO, this offers additional flexibility especially when natural-key(s) are chosen based on operational systems. These tend to have shorter lifecycle than the EDW whose lifetime should be, at least what EDW team hope for,  infinite!    By keeping natural-key(s) outside the Hub, we avoid making any assumption as to what are the natural-key(s) and their data type(s) in future.    AM propose to use a Natural-key View for handling the loading and look-up logic, and any change in natural keys would involve updating this View and adding the new Attribute table only.

    2. Attribute ≈ Satellite

    • Attribute is similar to the Satellite notion in DV.    Both adds contextual and descriptive data which are inherently time-dependent.   However to preserve the 6NF temporal integrity, AM enforces that each Attribute lands in dedicated table.   DV is more relaxed here and accepts any number of attributes in the same Satellite.  
    AM supports either temporal Attribute (referred to as historised in AM modeling front-end tool), or static and immutable Attribute where no temporal validity period is required.   DV modeling guideline assume historisation by default, and requires to store the full time-segment with two distinct time points: Valid-from and Valid-to.  Again the higher-normalised nature of AM restricts us from storing the redundant Valid-to time point (= subsequent Valid-from time point).

    The multi-attributes in DV imply that we must proceed with row comparison to know which one has changed between active record and previous one.  However, I'd argue that AM rule 1-Attribute = 1-table is too strict, and could be relaxed for some exceptional cases.   One example are attributes having strict functional dependency on a master attribute; think of user-friendly text attribute providing descriptive info for technical short code.  Other examples are for immutable attributes guaranteed to
    remain so in the lifetime of the entity they describe.

    3. Tie ≈ Link

    • Tie is similar to the Link notion in DV whose role is to store the relationship between Anchor/Hubs entities .   However Tie does not support over-hanging Attributes the same way Link have their own Satellites.  This limitation is due to the fact that historisation is built-in for Tie as with Attribute.   Switching-on historisation adds Valid-from time point to the Tie's Primary-key, and consequently any Attributes referring to the Tie would be left dangling (or else requires duplicating the Attribute row in violation of 6NF rule).   
    The drawback of AM follows that for any natural many-to-many associations having inherent attributes will involve the creation of a “tied anchor”.   DV is more flexible here, as it allows attaching Satellites to Link.   In DV time-period is not built-in inside the Link structure, so we are left with constructing the history of the relationship ourselves by adding time segment in external Satellite.  This is required when we want to keep track of time-period validity of any relationship.

    Tie may also have some form of attributes like when we deal with relationship having state/role information. Under this circumstance, one can add a Knot attribute to the Tie, becoming a "Knotted-tie" that can now support historical change of its state/role (the knot carries this state information).   With this scheme, you should have at most one Knotted attribute outside the relationship identifier (or another un-identifying relation in extra inside the Tie) for keeping track of the state or role information of the relationship (or the extra role-key tie). 

    4. Knot ≈ Reference 

    • Knot is similar to the Reference notion in DV.    Both of them hold immutable data of lower cardinality.  They normally represent a set of fixed categories, codes or static attributes and are normalised into separate table to avoid update anomalies and duplicating string values throughout the database.   Main difference is that Reference in DV can store multiple strings whereas again in AM only one string code per table is allowed.    It seems though that Knots are first class citizen, whereas in DV, Reference seems more like an ad-hoc optional structure.   Knots also enhance the semantic of its reference, for ex, an Attribute referring a knot becomes a knotted Attribute, a Tie becomes a knotted Tie, both being either static or historised.  



    A small digression on Time


    Temporal aspect of AM:

    The notion of Time in AM is well framed semantically with clear definition.   It provides support of 3 notions of Time:
    1. Happening time.   Corresponds to when some events/transactions occurred: as represented by specific Attribute(s) attached on Anchor
    2. Changing Time.   Corresponds to when some Attribute/Tie (when historised) are valid: as represented by the built-in “Valid-From” Time point in Attribute/Tie.
      1. Note that no redundant “Valid-To” is added to close the time-interval in order to comply with the 6NF requirement.  This avoids the nasty “update” during data loading process, and the potential risk of recording invalid time period.  But this comes at the price of additional complexity for reporting valid temporal attribute at specific time. 
    3. Recording time.  Corresponds to when the data was recorded into our platform (i.e. loading metadata):  as represented by FK’s  referencing a particular log entry in metadata Audit tables.
    Externalising the “Recording time” has the advantage of separating data from metadata, where all timestamp found in the data structures correspond to functional dates whereas metadata is located outside these data structures. 

    This contrasts with DV where metadata-info sits alongside the data.  Although the metadata timestamp “Load_Start” is not supposed to carry any functional meaning, it is confusing that most Query/Report examples given in DV documentation use this metadata timestamp for returning the valid data at given point in time!   Time when data is loaded are rarely correlated with Time of data validity.  This can happen in some occasion, like we have no other alternatives (see case 3 below).   The way I see things, when we need to determine the right functional “Valid-from”, we can be faced with 3 different situations :
    1. Dates do exist explicitly at source, so we load them as-is (typical of data having specified lifespan with effective/expiration date , etc..)
    2. Dates do not exist explicitly at source but can be deducted from some technical dates t source, so we use these technical dates (typical example is source having LastStatusUpdate meta-info allowing us to know when the data has changed)
    3. Dates do not exist explicitly at source neither can they be deducted from other technical dates, so that one is left with the only option of using the Loading time
    Ideally, situation 3 is more the exception than rules in most EDW implementation.


     

    Main advantages

    Most advantages of the AM approach results from its highly normalised structure.  This offers benefits such as:
      • Simpler Data loading involves cheap “insert” and no expensive “update/delete”
        • Fault recovery is more easily done since all rows data have a unique batch load traceability and can be deleted on exceptional situation (when erroneous)
        • Strict "No Null" policy :   null values never appear, thus eliminating difficulties like interpretability, indexing, and other issues.
        • Non-persistent elements never removed but flagged with Knot indicating their state of persistence
        • Maintenance is overall improved and simplified
      • Data model evolution only involves extension of previous version
        • New content always involve incremental addition of either Anchor / Tie / Attribute / Knot structures 
        • These New data structures may, sometimes, render some existing structure obsolete, then we can simply leave them as-is with no onward refresh.
      • Potential Higher Performance (here, it’s more tricky as the higher normalisation can also penalise performance, but the authors have highlighted a few valid points worth mentioning)
        • Narrow Query (hitting small number of attributes) can have very good performance with the aid of Table Elimination
        • Storage is highly efficient since practically no data redundancy exist, only a small number of indexes is actually needed due to the use of clustered table (aka index-organised table in different DB vendor) and also narrowness of table.   On this last point, it'd be nice to see how column-store architecture could be leveraged to exploit the AM unique schema.
        • Data loading practically free from locking and other concurrency issues as only “insert” are processed.
      • Reduce translation logic between different representation:  the unique AM graphical representation is used for both conceptual and logical modeling
        • Furthermore, the physical representation maps directly model entities onto tables, simplifying modeling abstraction levels or even eliminates the need for translation logic. 


    Martin