I recently finished a project whose goal was to provide an easy way to load data into a Database modelled using the Data Vault approaches. At the time, I had chosen an open source ETL tools coming called PDI from Pentaho's suites.
Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed. In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely. Please note, that the code works as specified however a lot more realistic performance test and improvement is expected. I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment. Anyone feeling up to it, feel free to fork it!
Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).
The rest of this post gives more detail on the tools.
INSTALLATION
Martin
Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed. In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely. Please note, that the code works as specified however a lot more realistic performance test and improvement is expected. I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment. Anyone feeling up to it, feel free to fork it!
Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).
The rest of this post gives more detail on the tools.
DV-loaders provide custom Transformation steps for loading Data Vault compliant objects: Hub, Link and Satellite. These are developed using plugin extension capability offered by Pentaho Data Integration tool (PDI, aka Kettle).
With these custom Steps, you can:
- Load quickly and easily Hub, Link and Satellite
- Define as many Steps as needed inside the same Transformation
- Load mandatory attributes compliant with logical Data Vault rules
- Load non-mandatory fields as-is (ex. batch-Id for auditing)
- Adjust buffer size to fine-tune performance
As an example, consider this simple DV data model:
Then one can load it with this simple Transformation:
INTRODUCTION
- Data Vault is a methodology suited for building an integration data layer of the Enterprise Data Warehouse that relies on a simple and repetitive pattern-based approach.
- ETL tool gives you access to any data input format and any backend output out of the box, without re-inventing the wheel.
- PDI is an open source ETL tool that can be extended by custom plugin
- Data Vault + ETL + PDI = DV-Loader plugin
- DV-Loader plugin makes it even easier and more performant to load any Data Vault model from data stream processed by PDI.
Details
Features
The PDI DV-Loader plugin offers :
- Simplicity
- easy configuration setting using a consistent ETL design-model
- same process usable for full AND incremental loading
- Load Hub and Link "expose" technical key (PK) to downstream steps (usable for dependent Satellites)
- ETL data flow structure closely follows the DV data model
- Robustness
- fully re-entrant and re-startable ETL process
- support duplicated record in input stream
- support unsorted satellite historical records
- support "a posteriori" satellite historical catch-up
- Perfomance
- leverage JDBC batch processing
- query lookup done not on individual key but rather using a set of keys to minimize the number of sluggish DB round-trip
- batch and key lookup size is defined by the parameter Buffer size
- Compliancy
- load mandatory DV fields using DV logic:
- technical keys
- business keys
- satellite fields
- temporal fields setting satellite record validity
- audit mandatory fields
- load other none-mandatory field as-is (pass-through attribute like batch-Id for meta-audit)
Assumptions
A very small number of assumptions is assumed:
- Hub and Link must have primary key defined as unique sequence integer (support any DB data type used for storing integer)
- Temporal validity of the Satellite rows are determined through a single temporal field (Date type) available in the incoming record stream (entrant hop)
- Business keys must all be Not Nullable
Rules
The ETL flow design must respect these logical rules:
- Hub can be loaded once its business key(s) are available in input record stream
- Link can be loaded once all referred Hub primary keys are available in input record stream
- Satellite can be loaded once referred hub (or Link) primary keys are available in input record stream (normally appended upstream by Load-Hub step)
USER GUIDE
ETL data flow
Data model is loaded by attaching Load Steps through hops :
- Load Hub
- Connect Step through a hop containing all business/natural key(s) in input stream
- Step will look-up business/natural key(s) and append the associated tech-key (Hub's PK)
- When found: append the tech-key returned by DB
- When not found: generate new key (using sequence method defined) and append it
- Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
- Load Link
- Connect Step through hop containing all tech keys of the relationship
- Step will look-up Hub tech keys and append the associated Link tech-key (its PK)
- When found: append the tech-key returned by DB
- When not found: generate new key (using sequence method defined) and append it
- Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
- Load Satellite
- Connect Step through a hop containing the Hub or Link's PK tech-key, the attributes and the "From-Date" temporal attribute controlling satellite record's lifecycle
- Step will load new satellite record based on different use-case:
- Temporal Satellite with Idempotent=true: sat record is loaded unless there is an identical consecutive records (default)
- Temporal Satellite with Idempotent=false: sat record is loaded unless irrespective of consecutive records
- Static Satellite (no "From-Date" temporal attribute defined): sat record is loaded unless one already exist for the Hub
Setting common to all Step
General Properties:
Property | Definition |
Step name | Name of the Step must be unique within the Transformation |
Connection | Database connection to use |
Hub/Link/Sat table | Target table name to load |
Buffer size | This determines the number of input rows processed at once. Too large value may generate Query lookup or batch insert causing JDBC driver error (typical values are of range 500 or more) |
Audit-related Properties:
Property | Definition |
Sys-creation Date/Time | The column holding the Timestamp indicating when record was loaded into DB (leave empty when not used |
Load Record Source | Column holding the Audit record source information (leave empty if not used) |
Record source value | Value to store in the column "Load Record source" (may be set by a variable substitution) |
Setting specific to Step "Load Hub"
Property | Definition |
Attribute Mapping | Mapping between input stream field and database column |
Attribute of type Business/Natural Key | field(s) corresponding to the business key(s) |
Other type | pass-through field simply loaded as-is (useful for attribute like batch-id, etc. |
PK Sequence-Id setting | Define the technical PK column and which sequence generation method to use |
Setting specific to Step "Load Link"
Property | Definition |
Attribute Mapping | Mapping between input stream field and database column |
Attribute of type "Relationship Key" | field corresponding to the Relationship keys (i.e. each Hub's Primary key) |
Other type | pass-through field simply loaded as-is (useful for attribute like batch-id, etc.) |
PK Sequence-Id setting | Define the technical PK column of the Link and which sequence generation method to use |
Setting specific to Step "Load Sat"
Property | Definition |
Idempotent transformation | Idempotent ignores records having identical state at two consecutive time (all attributes are equal). Data Vault standard is Idempotent, but you may have different requirements |
Attribute Mapping | Mapping between input stream field and database column |
Attribute of type "Foreign-Key to Hub" | field representing the Hub's PK (typically provided by an upstream "Load Hub" step) |
Attribute of type "From-Date Temporal" | field representing the temporal attribute controlling the timeline of each Satellite record. Using an input field instead of a fixed |
Attribute of type Normal | fields recorded inside the Satellite. All these control sat record timeline (or lifecycle). Pass-through fields may only be added if their changing values do not impact Satellite lifecycle with regard to the Hub record (the ETL batch-id attribute is a valid example of this) |
INSTALLATION
Pre-requisite
PDI version 5.x
Download and Install
* Download latest archive/package
* Unzip it inside folder: ${PDI_HOME}/plugins/steps
Check installation
* new folder 'DV-loader' should now exist in: ${PDI_HOME}/plugins/steps
* Re-start PDI Spoon UI (${PDI_HOME}/spoon.sh or ${PDI_HOME}/spoon.bat)
* Create a PDI Transformation and add DV-loader steps found under category Experimental:
- Load Hub
- Load Link
- Load Sat
Martin
No comments:
Post a Comment