Sunday, March 09, 2014

EAV vs AM


One important area of concern in terms of database flexibility is the data model structures used for integrating new content.    This post will look at two completely different strategies used for building model structures that can easily accommodate change (at least compared to the typical 3rd normal form but not so for the typical noSQL form).     These are :
  1. Use an Entity Attribute Value structure (EAV)
  2. Use an Anchor Modeling structure (AM)  
In some way, these two structures are opposed to each other in their way to provide greater flexibility: EAV ingests new data content through its generic model, while AM uses highly normalised model for ingesting new content to adapt and grow incrementally without .   Note the term content here implies the modification of data structure and adding new entity and not simply the addition of new record, although this distinction is blurred for the case of EAV.

I'll discuss the pros/cons of these two here.


1) EAV structure


The Entity Attribute Value structure uses an open schema to store Parameter/Value pair.  This provides extreme flexibility in dealing with evolving scope of entities and parameters (here Parameter=Attribute).

This structure is good for : 1) high Sparsity Attribute values; 2) Attributes highly variable and unknown. 


Pros:
  1. Fully evolutive: Data model and API layer can accommodate addition of any new Parameter type with some magic use of metadata (only metadata must be changed) 
  2. Efficient storage for highly sparse data
  3. Seem common in neuroinformatics world (ref to metadata standard: odML)

Cons:
  1. EAV is more of an exception or anti-pattern in data modelling and is applicable for a few limited use cases
  2. Require a rich metadata infrastructure: data-related business rules no longer enforced by relational engine but by its metadata layer
  3. Higher complexity in data request, ad-hoc query (especially those cross-referencing multiple Attributes), data bulk extraction etc..

Implementation Impact:

  1. often need a dedicated layer for managing data query complexity (e.g. ElasticSearch)
  2. a no-SQL backend maybe more adapted (except for the metadata) unless there are clear requirements for concurrent transactional integrity and some form of isolation level 
  3. When Historisation is a requirement, need to enrich EAV to support temporal attributes, and thus increasing querying complexity further.  There does not seem to be any temporal field inside the EAV instance table of its typical reference implementation (see ycmi as an example)  



2)  AM structure

AM favors strong (and fine-grained) typing of database schema while reducing a few symptoms common to rigid relational data model.  This is achieved by using higher normalisation structure (6NF).  

The AM structure is good for 1) integration of unforeseen and evolutive data model piece-wise, 2) offering clear and semantic-rich schema.

Pros:
  1. Allows for data model evolution through organic extension without revisiting existing model structure
  2. Historisation of all attributes is built-in without any compromise  
  3. Not penalised by highly sparse data (NULL never needed)  
  4. Produce models that are expressive and clear while carrying insights to the particular domain

Cons:
  1. Require extension of both data model and API layer with each new data model type 
  2. May suffer from table explosion (mitigated with a virtual access-layer) 
  3. Need special Query execute plan optimisation (table elimination


Implementation Impact:
  1. need DB engine with necessary optimisation (seem PostgreSQL meets some requirement, but should be further validated)
  2. assume entities have strong identity with available candidate for natural keys
  3. could require applying some form of generalisation to be less sensitive to addition of parameter (equivalent of ETL pattern when working in PULL mode)
  4. Data modeller needs to know enough insight and have good Data Domain knowledge (data model end-result actually looks similar to domain ontologies) 



-----------

Data Model example

Here I try to model the scope that was presented to me at work.  However, I have skipped a few details and generalise the concept furthermore,  

In AM we should correctly separate Entity having clear identity (the Hub , shown in blue) from their attributes/descriptive/measures (the Attribute, shown in red), and identify relationship/association /transaction (the Tie shown in yellow).   The important thing to retain is that AM offer end-result model that are expressive and provide insights to the particular domain.  This is more easily done with one ore more SME.
   

An example of a Anchor Data model









Note: this does not follow a strict AM form :  Tie have attributes, naming convention not follow, more than one Attribute per table, audit metadata inside primitive, etc...


For the EAV, I used one reference implementation:
http://ycmi.med.yale.edu/nadkarni/EAV_CR_frame.htm.  I've added historisation capability at the attribute/value pair (EAV tables) as well as adding explicit common attributes to all Entities type.  Here, I've added natural-key for lookup and update existing entities (oddly enough there was none in the EAV ref, I guess these are buried inside one EAV attribute?).

This is inspired from the concepts of odML and its  metadata relationship: Section 0..n Property 1..m Value.   Section is used for grouping entities of same type and associated with a set of Property relevant to these entities.  These Section/Property/Value will grow as we add new data type into the PARAM_SPACE:  


An example of a EAV data model












Martin

Saturday, March 08, 2014

Anchor modeling (part-2)


In previous post, I describe the Anchor modeling approach.  Here I will attempt to comment on its merit and providing a very small example of a data model .

My take on it:

Although, I never implemented AM on a large scale project myself, I can certainly see a few points that are very attractive at first glance.  Here are some that quickly come to mind :
  • The model prescription really goes up to the 6NF (except when we decide to include multiple non-identifying Anchor FK or knotted attribute into a single Tie).  This 6NF implies that no other attributes are stored along with the Anchor surrogate-key, not even the natural-key(s).  Potential benefits are :  
    • Improved maintenance when dealing with evolutive or unstable natural-key.  This follows the principle of model extension where we can accommodate new or change of natural keys in the future.  
  • The modelling prescription details specific and strict rules .
    • Advantage: This reduces the likelihood of our model to go “wrong”.   As a data modeller you are offered much smaller degrees of freedom, at least after you have defined Anchors and Ties.  Modelling Tie may be more flexible, as it seems, you are free to add more or less non-identifying keys in the Relationship. 
  • The model guideline recognises the importance of State often describing relationship (these sort of state are always mutually exclusive and exhaustive).  
    • Advantage: Tie can be Knotted for storing this data in a straightforward way with no need to come up with external structure besides the Knot.   In DV, you’d need external Satellites for storing the attributes and their time-segment.  This feature is convenient as in real-world project we often see this pattern where relationship goes through some form of lifecycle changing state.  
    • Disadvantage:  The flip side of this is that once your relationship is built like that, then it becomes less flexible and can hardly accommodate new stuff...In this respect, DV Links modelled as immutable intersection of Keys offer better longterm flexibility.
  • The model guideline also recognise the importance of having at most one key, called Role, in the relationship outside the Tie key identifiers.  
    • Advantage: For Historical Tie, this makes explicit what changes have triggered an update in the relationship.
  • The model guideline recommends keeping Tie’s width small (#of Roles).   In practice, we should try breaking down larger Tie into smaller constituents.  This helps in regard to : 
    • handling of late arrival facts (or asynchronous) which would delay the capture of the relationship as recording can only happen once all roles are known;
    • decreased data redundancy generated from historised relationships with frequent changing state/property.  In this scenario, lots of nearly identical rows arise since all roles, except the non-identifying state/property role, stay identical.
    • increased stability:  larger relationship are more likely to become deprecated rendering your AM model less evolutive with more frequent replacement of existing Tie with new ones.
  • The Knot is a primitive modeling construct holding important referential data:
    • Advantage: static set text values are easily recognised.  We can avoid disseminating these text values, typical of lesser normalised structures, and thus reducing the data redundancy throughout our data model.
  • The model guideline proposes a clear separation between metadata and data.  Auditing info is stored and maintained externally from data into their own metadata structures.  Metadata is referred to by FK pointing to a global Audit table in a standard and global way.   
    • Maintaining data separate from metadata leads to cleaner data structures and less chance of misinterpreting timestamp data fields.

However an AM implementation has its drawback, the most important one being that we probably need a complete toolset and a separate access layer to help us manage the explosion of data structures (table, views..).  No one would want to interact directly at the physical level any real size AM implementation.

Other important issues could come from DB vendor's limitation which may not support some functions required (ex. Table Elimination is almost mandatory for query optimisation).   Bottom line, you certainly need a lot of experiment before leveraging this technique, and see if the additional complexity can be managed and mitigated in a large scale implementation.


A small Example:

I conclude this post by presenting a small example.   Using the available modeling tool, you can quickly start creating your own model.  So let’s imagine a DB used to collect any info, statistics, indicators/ metrics that are produced about countries and cities in the world.   As these data would need to be refreshed periodically it is important to capture the history as well as sourcing, format, and other meta-info.

The diagram model (each entity type has dedicated appearance) presented next, gives the overview of the data model.  Red squares are used for representing Anchor, circle for Attribute, rounded square for Knot while grey diamond-shaped represents Tie.   Optionally, we may choose to keep Historisation for Tie or Attribute, in which case we have outline represented as double-line.

General view of a simple AM model



The naming and mnemonic convention is used to automatically provide physical tables names (as well as views and functions).    We can toggle between mnemonic or real name in the diagram.

We see below the two main Anchor: City and Country naturally linked through a Tie, as well as with a few other Anchors tied together.
Model zoom-in around Hubs Country and City


Let’s suppose that during the lifetime of our Data repository, we had decided to keep as natural-key the ISO ALPHA-2 (2-letter code).  This was used as the country look-up with the help of Natural-key View.  However later on, we realised that a  lot of data sources actually used the ISO ALPHA-3 (3-letter code) instead as country identifier.   No problem here, our AM model supports transparently the addition of any sort of Attributes.  The only impact is that we now have two Natural-key View to choose from for the look-up.

We later discovered with disbelief, that our natural-keys are NOT immutable… quite annoying for a database key.   As always, this seemed easy to know in hindsight, but at the time who could have guessed that our ISO standard key will betray us.   Our data model must then allow for natural-keys evolution in time!  Again no problem, our AM model has “Historisation by Design” built-in, and the only impact is the addition of a new field for temporal validity inside the natural Attribute tables.  We would also need a slight change to the "Natural-key View"  to accommodate for the time point entry necessary to manage the new look-up logic.

Important note: had we kept our natural-key(s) along the surrogate, these evolutions would have involved breaking existing elements of our model!  This illustrates the benefit of higher normalisation with AM.

Conversely, from our small business case, we clearly see where the 6th NF could cause us real harm… If we start adding new each indicator as one Attribute around the Country and City (and historized as each can be updated in future), there will quickly be an explosion of tables making the entire data model unsustainable!!

To avoid this pitfall we make use of data model generalization principles and create more abstract entities.  So in place of adding new tables to store values for each new indicator, we turned these tables into rows.   In our simple scenario, this becomes possible by defining new abstract entities :  Indicator and the Indicator-instance.      Each Indicator has a clear identity (Anchor: Indicator) and is described by its name, its definition, its source (i.e. its surrounding Attributes).   The Indicator-value is represented as an anchor (IndInstance), whose identity correspond to the Indicator-value given for a particular country (or city) and at a given year.   One  Attribute holds the exact value while a Tie is needed to tie everything together (the indicator, the country and the year-period).

Model entities for Indicator and their instance values


More meta-info could be used in order to locate the information sourcing on the web, and other info related to format, document type.


Martin


Sunday, March 02, 2014

ETL plugins for Data Vault

I recently finished a project whose goal was to provide an easy way to load data into a Database modelled using the Data Vault approaches.    At the time, I had chosen an open source ETL tools coming called PDI from Pentaho's suites.

Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed.  In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely.  Please note, that the code works as specified however a lot more realistic performance test and improvement is expected.  I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment.  Anyone feeling up to it, feel free to fork it!

Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).


The rest of this post gives more detail on the tools.


DV-loaders provide custom Transformation steps for loading Data Vault compliant objects: Hub, Link and Satellite. These are developed using plugin extension capability offered by Pentaho Data Integration tool (PDI, aka Kettle).

With these custom Steps, you can:
  1. Load quickly and easily HubLink and Satellite
  2. Define as many Steps as needed inside the same Transformation
  3. Load mandatory attributes compliant with logical Data Vault rules
  4. Load non-mandatory fields as-is (ex. batch-Id for auditing)
  5. Adjust buffer size to fine-tune performance
As an example, consider this simple DV data model:
Then one can load it with this simple Transformation:


INTRODUCTION


Data Vault is a methodology suited for building an integration data layer of the Enterprise Data Warehouse that relies on a simple and repetitive pattern-based approach.
ETL tool gives you access to any data input format and any backend output out of the box, without re-inventing the wheel.
PDI is an open source ETL tool that can be extended by custom plugin
Data Vault + ETL + PDI = DV-Loader plugin
DV-Loader plugin makes it even easier and more performant to load any Data Vault model from data stream processed by PDI.

Details


Features

The PDI DV-Loader plugin offers :
  1. Simplicity
    • easy configuration setting using a consistent ETL design-model
    • same process usable for full AND incremental loading
    • Load Hub and Link "expose" technical key (PK) to downstream steps (usable for dependent Satellites)
    • ETL data flow structure closely follows the DV data model
  2. Robustness
    • fully re-entrant and re-startable ETL process
    • support duplicated record in input stream
    • support unsorted satellite historical records
    • support "a posteriori" satellite historical catch-up
  3. Perfomance
    • leverage JDBC batch processing
    • query lookup done not on individual key but rather using a set of keys to minimize the number of sluggish DB round-trip
    • batch and key lookup size is defined by the parameter Buffer size
  4. Compliancy
    • load mandatory DV fields using DV logic:
      • technical keys
      • business keys
      • satellite fields
      • temporal fields setting satellite record validity
      • audit mandatory fields
    • load other none-mandatory field as-is (pass-through attribute like batch-Id for meta-audit)

Assumptions

A very small number of assumptions is assumed:
  • Hub and Link must have primary key defined as unique sequence integer (support any DB data type used for storing integer)
  • Temporal validity of the Satellite rows are determined through a single temporal field (Date type) available in the incoming record stream (entrant hop)
  • Business keys must all be Not Nullable

Rules

The ETL flow design must respect these logical rules:
  • Hub can be loaded once its business key(s) are available in input record stream
  • Link can be loaded once all referred Hub primary keys are available in input record stream
  • Satellite can be loaded once referred hub (or Link) primary keys are available in input record stream (normally appended upstream by Load-Hub step)


USER GUIDE

ETL data flow

Data model is loaded by attaching Load Steps through hops :
  1. Load Hub
    • Connect Step through a hop containing all business/natural key(s) in input stream
    • Step will look-up business/natural key(s) and append the associated tech-key (Hub's PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  2. Load Link
    • Connect Step through hop containing all tech keys of the relationship
    • Step will look-up Hub tech keys and append the associated Link tech-key (its PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  3. Load Satellite
    • Connect Step through a hop containing the Hub or Link's PK tech-key, the attributes and the "From-Date" temporal attribute controlling satellite record's lifecycle
    • Step will load new satellite record based on different use-case:
      • Temporal Satellite with Idempotent=true: sat record is loaded unless there is an identical consecutive records (default)
      • Temporal Satellite with Idempotent=false: sat record is loaded unless irrespective of consecutive records
      • Static Satellite (no "From-Date" temporal attribute defined): sat record is loaded unless one already exist for the Hub

Setting common to all Step

General Properties:
PropertyDefinition
Step nameName of the Step must be unique within the Transformation
ConnectionDatabase connection to use
Hub/Link/Sat tableTarget table name to load
Buffer sizeThis determines the number of input rows processed at once. Too large value may generate Query lookup or batch insert causing JDBC driver error (typical values are of range 500 or more)
Audit-related Properties:
PropertyDefinition
Sys-creation Date/TimeThe column holding the Timestamp indicating when record was loaded into DB (leave empty when not used
Load Record SourceColumn holding the Audit record source information (leave empty if not used)
Record source valueValue to store in the column "Load Record source" (may be set by a variable substitution)

Setting specific to Step "Load Hub"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type Business/Natural Keyfield(s) corresponding to the business key(s)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.
PK Sequence-Id settingDefine the technical PK column and which sequence generation method to use

Setting specific to Step "Load Link"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type "Relationship Key"field corresponding to the Relationship keys (i.e. each Hub's Primary key)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.)
PK Sequence-Id settingDefine the technical PK column of the Link and which sequence generation method to use

Setting specific to Step "Load Sat"

PropertyDefinition
Idempotent transformationIdempotent ignores records having identical state at two consecutive time (all attributes are equal). Data Vault standard is Idempotent, but you may have different requirements
Attribute MappingMapping between input stream field and database column
Attribute of type "Foreign-Key to Hub"field representing the Hub's PK (typically provided by an upstream "Load Hub" step)
Attribute of type "From-Date Temporal"field representing the temporal attribute controlling the timeline of each Satellite record. Using an input field instead of a fixed adds more flexibility: if is appropriate, then we simply append it upstream, but other attribute could also be used (ex. when using file input, when using table input, etc..). You use the temporal attribute most adapted for your use-case
Attribute of type Normalfields recorded inside the Satellite. All these control sat record timeline (or lifecycle). Pass-through fields may only be added if their changing values do not impact Satellite lifecycle with regard to the Hub record (the ETL batch-id attribute is a valid example of this)


INSTALLATION

Pre-requisite

 PDI version 5.x

Download and Install

* Download latest archive/package
* Unzip it inside folder: ${PDI_HOME}/plugins/steps

Check installation

* new folder 'DV-loader' should now exist in: ${PDI_HOME}/plugins/steps
* Re-start PDI Spoon UI (${PDI_HOME}/spoon.sh or ${PDI_HOME}/spoon.bat)
* Create a PDI Transformation and add DV-loader steps found under category Experimental:
  • Load Hub
  • Load Link
  • Load Sat

Martin