Sunday, March 02, 2014

ETL plugins for Data Vault

I recently finished a project whose goal was to provide an easy way to load data into a Database modelled using the Data Vault approaches.    At the time, I had chosen an open source ETL tools coming called PDI from Pentaho's suites.

Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed.  In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely.  Please note, that the code works as specified however a lot more realistic performance test and improvement is expected.  I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment.  Anyone feeling up to it, feel free to fork it!

Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).


The rest of this post gives more detail on the tools.


DV-loaders provide custom Transformation steps for loading Data Vault compliant objects: Hub, Link and Satellite. These are developed using plugin extension capability offered by Pentaho Data Integration tool (PDI, aka Kettle).

With these custom Steps, you can:
  1. Load quickly and easily HubLink and Satellite
  2. Define as many Steps as needed inside the same Transformation
  3. Load mandatory attributes compliant with logical Data Vault rules
  4. Load non-mandatory fields as-is (ex. batch-Id for auditing)
  5. Adjust buffer size to fine-tune performance
As an example, consider this simple DV data model:
Then one can load it with this simple Transformation:


INTRODUCTION


Data Vault is a methodology suited for building an integration data layer of the Enterprise Data Warehouse that relies on a simple and repetitive pattern-based approach.
ETL tool gives you access to any data input format and any backend output out of the box, without re-inventing the wheel.
PDI is an open source ETL tool that can be extended by custom plugin
Data Vault + ETL + PDI = DV-Loader plugin
DV-Loader plugin makes it even easier and more performant to load any Data Vault model from data stream processed by PDI.

Details


Features

The PDI DV-Loader plugin offers :
  1. Simplicity
    • easy configuration setting using a consistent ETL design-model
    • same process usable for full AND incremental loading
    • Load Hub and Link "expose" technical key (PK) to downstream steps (usable for dependent Satellites)
    • ETL data flow structure closely follows the DV data model
  2. Robustness
    • fully re-entrant and re-startable ETL process
    • support duplicated record in input stream
    • support unsorted satellite historical records
    • support "a posteriori" satellite historical catch-up
  3. Perfomance
    • leverage JDBC batch processing
    • query lookup done not on individual key but rather using a set of keys to minimize the number of sluggish DB round-trip
    • batch and key lookup size is defined by the parameter Buffer size
  4. Compliancy
    • load mandatory DV fields using DV logic:
      • technical keys
      • business keys
      • satellite fields
      • temporal fields setting satellite record validity
      • audit mandatory fields
    • load other none-mandatory field as-is (pass-through attribute like batch-Id for meta-audit)

Assumptions

A very small number of assumptions is assumed:
  • Hub and Link must have primary key defined as unique sequence integer (support any DB data type used for storing integer)
  • Temporal validity of the Satellite rows are determined through a single temporal field (Date type) available in the incoming record stream (entrant hop)
  • Business keys must all be Not Nullable

Rules

The ETL flow design must respect these logical rules:
  • Hub can be loaded once its business key(s) are available in input record stream
  • Link can be loaded once all referred Hub primary keys are available in input record stream
  • Satellite can be loaded once referred hub (or Link) primary keys are available in input record stream (normally appended upstream by Load-Hub step)


USER GUIDE

ETL data flow

Data model is loaded by attaching Load Steps through hops :
  1. Load Hub
    • Connect Step through a hop containing all business/natural key(s) in input stream
    • Step will look-up business/natural key(s) and append the associated tech-key (Hub's PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  2. Load Link
    • Connect Step through hop containing all tech keys of the relationship
    • Step will look-up Hub tech keys and append the associated Link tech-key (its PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  3. Load Satellite
    • Connect Step through a hop containing the Hub or Link's PK tech-key, the attributes and the "From-Date" temporal attribute controlling satellite record's lifecycle
    • Step will load new satellite record based on different use-case:
      • Temporal Satellite with Idempotent=true: sat record is loaded unless there is an identical consecutive records (default)
      • Temporal Satellite with Idempotent=false: sat record is loaded unless irrespective of consecutive records
      • Static Satellite (no "From-Date" temporal attribute defined): sat record is loaded unless one already exist for the Hub

Setting common to all Step

General Properties:
PropertyDefinition
Step nameName of the Step must be unique within the Transformation
ConnectionDatabase connection to use
Hub/Link/Sat tableTarget table name to load
Buffer sizeThis determines the number of input rows processed at once. Too large value may generate Query lookup or batch insert causing JDBC driver error (typical values are of range 500 or more)
Audit-related Properties:
PropertyDefinition
Sys-creation Date/TimeThe column holding the Timestamp indicating when record was loaded into DB (leave empty when not used
Load Record SourceColumn holding the Audit record source information (leave empty if not used)
Record source valueValue to store in the column "Load Record source" (may be set by a variable substitution)

Setting specific to Step "Load Hub"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type Business/Natural Keyfield(s) corresponding to the business key(s)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.
PK Sequence-Id settingDefine the technical PK column and which sequence generation method to use

Setting specific to Step "Load Link"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type "Relationship Key"field corresponding to the Relationship keys (i.e. each Hub's Primary key)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.)
PK Sequence-Id settingDefine the technical PK column of the Link and which sequence generation method to use

Setting specific to Step "Load Sat"

PropertyDefinition
Idempotent transformationIdempotent ignores records having identical state at two consecutive time (all attributes are equal). Data Vault standard is Idempotent, but you may have different requirements
Attribute MappingMapping between input stream field and database column
Attribute of type "Foreign-Key to Hub"field representing the Hub's PK (typically provided by an upstream "Load Hub" step)
Attribute of type "From-Date Temporal"field representing the temporal attribute controlling the timeline of each Satellite record. Using an input field instead of a fixed adds more flexibility: if is appropriate, then we simply append it upstream, but other attribute could also be used (ex. when using file input, when using table input, etc..). You use the temporal attribute most adapted for your use-case
Attribute of type Normalfields recorded inside the Satellite. All these control sat record timeline (or lifecycle). Pass-through fields may only be added if their changing values do not impact Satellite lifecycle with regard to the Hub record (the ETL batch-id attribute is a valid example of this)


INSTALLATION

Pre-requisite

 PDI version 5.x

Download and Install

* Download latest archive/package
* Unzip it inside folder: ${PDI_HOME}/plugins/steps

Check installation

* new folder 'DV-loader' should now exist in: ${PDI_HOME}/plugins/steps
* Re-start PDI Spoon UI (${PDI_HOME}/spoon.sh or ${PDI_HOME}/spoon.bat)
* Create a PDI Transformation and add DV-loader steps found under category Experimental:
  • Load Hub
  • Load Link
  • Load Sat

Martin

No comments: