Anchor Modeling (AM) is another approach well suited for modeling the integrated/consolidated data layer within an enterprise data warehouse. AM was originally created by Olle Regardt with formalisation done by Lars Rönnbäck. It offers an open source database modeling technique accompanied by an online modeling tool available through an MIT license. Most info is maintained around AM home.
The approach has some theoretical background with ideas borrowed from the information modeling world. More specifically it is inspired by a method called Object Role Modeling (ORM) developed by Dr. Terry Halpin. ORM, not to be confused with the Object Relation Mapping, is a modeling notation designed to help non-expert doing conceptual modeling of database schema.
The theoretical formalisation has also helped provide a very strict and prescriptive methodology compared to some more ad-hoc approaches used in the world of data warehouse (often heuristic-based developed with time and experience).
It also offers some implementation ecamples, but at this time, the only complete implementation is ported onto the Microsoft SQL-Server engine.
Comparison to Data Vault
I will not go into details as to what is Anchor Modeling here. Instead I will describe the similarities it shares with Data Vault (DV), since both aims to decompose any data model into smaller and more basic constituents. For those interested in detail description, please refer to this excellent article Anchor modeling - agile information modeling in evolving data environments written by its creators.1. Anchor ≈ Hub
- Both Anchor and Hub store Entities having strong identity. There is an important difference: in AM, an Anchor stands completely on its own, i.e. it only stores the surrogate-key (aka technical key) so keep natural-key (aka business-key) in separate Attribute table.
2. Attribute ≈ Satellite
- Attribute is similar to the Satellite notion in DV. Both adds contextual and descriptive data which are inherently time-dependent. However to preserve the 6NF temporal integrity, AM enforces that each Attribute lands in dedicated table. DV is more relaxed here and accepts any number of attributes in the same Satellite.
The multi-attributes in DV imply that we must proceed with row comparison to know which one has changed between active record and previous one. However, I'd argue that AM rule 1-Attribute = 1-table is too strict, and could be relaxed for some exceptional cases. One example are attributes having strict functional dependency on a master attribute; think of user-friendly text attribute providing descriptive info for technical short code. Other examples are for immutable attributes guaranteed to
remain so in the lifetime of the entity they describe.
3. Tie ≈ Link
- Tie is similar to the Link notion in DV whose role is to store the relationship between Anchor/Hubs entities . However Tie does not support over-hanging Attributes the same way Link have their own Satellites. This limitation is due to the fact that historisation is built-in for Tie as with Attribute. Switching-on historisation adds Valid-from time point to the Tie's Primary-key, and consequently any Attributes referring to the Tie would be left dangling (or else requires duplicating the Attribute row in violation of 6NF rule).
4. Knot ≈ Reference
- Knot is similar to the Reference notion in DV. Both of them hold immutable data of lower cardinality. They normally represent a set of fixed categories, codes or static attributes and are normalised into separate table to avoid update anomalies and duplicating string values throughout the database. Main difference is that Reference in DV can store multiple strings whereas again in AM only one string code per table is allowed. It seems though that Knots are first class citizen, whereas in DV, Reference seems more like an ad-hoc optional structure. Knots also enhance the semantic of its reference, for ex, an Attribute referring a knot becomes a knotted Attribute, a Tie becomes a knotted Tie, both being either static or historised.
A small digression on Time
Temporal aspect of AM: The notion of Time in AM is well framed semantically with clear definition. It provides support of 3 notions of Time:
This contrasts with DV where metadata-info sits alongside the data. Although the metadata timestamp “Load_Start” is not supposed to carry any functional meaning, it is confusing that most Query/Report examples given in DV documentation use this metadata timestamp for returning the valid data at given point in time! Time when data is loaded are rarely correlated with Time of data validity. This can happen in some occasion, like we have no other alternatives (see case 3 below). The way I see things, when we need to determine the right functional “Valid-from”, we can be faced with 3 different situations :
|
Main advantages
Most advantages of the AM approach results from its highly normalised structure. This offers benefits such as:- Simpler Data loading involves cheap “insert” and no expensive “update/delete”
- Fault recovery is more easily done since all rows data have a unique batch load traceability and can be deleted on exceptional situation (when erroneous)
- Strict "No Null" policy : null values never appear, thus eliminating difficulties like interpretability, indexing, and other issues.
- Non-persistent elements never removed but flagged with Knot indicating their state of persistence
- Maintenance is overall improved and simplified
- Data model evolution only involves extension of previous version
- New content always involve incremental addition of either Anchor / Tie / Attribute / Knot structures
- These New data structures may, sometimes, render some existing structure obsolete, then we can simply leave them as-is with no onward refresh.
- Potential Higher Performance (here, it’s more tricky as the higher normalisation can also penalise performance, but the authors have highlighted a few valid points worth mentioning)
- Narrow Query (hitting small number of attributes) can have very good performance with the aid of Table Elimination
- Storage is highly efficient since practically no data redundancy exist, only a small number of indexes is actually needed due to the use of clustered table (aka index-organised table in different DB vendor) and also narrowness of table. On this last point, it'd be nice to see how column-store architecture could be leveraged to exploit the AM unique schema.
- Data loading practically free from locking and other concurrency issues as only “insert” are processed.
- Narrow Query (hitting small number of attributes) can have very good performance with the aid of Table Elimination
- Reduce translation logic between different representation: the unique AM graphical representation is used for both conceptual and logical modeling
- Furthermore, the physical representation maps directly model entities onto tables, simplifying modeling abstraction levels or even eliminates the need for translation logic.
Martin
2 comments:
The comparison is not very accurate because classification schemes between Anchor Modeling and DV differ considerably. Comparing directly is not that helpful.
an Anchor can be a Hub or an dependent hub ('Anchor'Link). a Tie is an end dated link (not supported with vanilla DV). Knots and references serve a similar purpose, but implementations differ wildly.
ORM is rooted in Fact Based Modeling (NIAM,FCO-IM) etc.
Anchor Modeling is a dumbed down of this. Better is it to use ORM/NIAM to compare DV and AM (but first need to translate them to a ORM/NIAM scheme). See http://dm-unseen.blogspot.nl/2013/11/one-data-modeling-approach-to-rule-them.html
Yes a thorough and accurate comparison was not my goal. I simply presented what is Anchor modelling using terms originating from more well known Data Vault. There seem to be ideas that either side could inspire from : ex. the ''Tie is an end dated link" (actually not true as Tie can also be non-historized), is one example that offers a simple pattern for tracking down state-based relationship. Thanks for sharing the link, I'll have a look.
Post a Comment