Showing posts with label ETL. Show all posts
Showing posts with label ETL. Show all posts

Sunday, March 02, 2014

ETL plugins for Data Vault

I recently finished a project whose goal was to provide an easy way to load data into a Database modelled using the Data Vault approaches.    At the time, I had chosen an open source ETL tools coming called PDI from Pentaho's suites.

Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed.  In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely.  Please note, that the code works as specified however a lot more realistic performance test and improvement is expected.  I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment.  Anyone feeling up to it, feel free to fork it!

Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).


The rest of this post gives more detail on the tools.


DV-loaders provide custom Transformation steps for loading Data Vault compliant objects: Hub, Link and Satellite. These are developed using plugin extension capability offered by Pentaho Data Integration tool (PDI, aka Kettle).

With these custom Steps, you can:
  1. Load quickly and easily HubLink and Satellite
  2. Define as many Steps as needed inside the same Transformation
  3. Load mandatory attributes compliant with logical Data Vault rules
  4. Load non-mandatory fields as-is (ex. batch-Id for auditing)
  5. Adjust buffer size to fine-tune performance
As an example, consider this simple DV data model:
Then one can load it with this simple Transformation:


INTRODUCTION


Data Vault is a methodology suited for building an integration data layer of the Enterprise Data Warehouse that relies on a simple and repetitive pattern-based approach.
ETL tool gives you access to any data input format and any backend output out of the box, without re-inventing the wheel.
PDI is an open source ETL tool that can be extended by custom plugin
Data Vault + ETL + PDI = DV-Loader plugin
DV-Loader plugin makes it even easier and more performant to load any Data Vault model from data stream processed by PDI.

Details


Features

The PDI DV-Loader plugin offers :
  1. Simplicity
    • easy configuration setting using a consistent ETL design-model
    • same process usable for full AND incremental loading
    • Load Hub and Link "expose" technical key (PK) to downstream steps (usable for dependent Satellites)
    • ETL data flow structure closely follows the DV data model
  2. Robustness
    • fully re-entrant and re-startable ETL process
    • support duplicated record in input stream
    • support unsorted satellite historical records
    • support "a posteriori" satellite historical catch-up
  3. Perfomance
    • leverage JDBC batch processing
    • query lookup done not on individual key but rather using a set of keys to minimize the number of sluggish DB round-trip
    • batch and key lookup size is defined by the parameter Buffer size
  4. Compliancy
    • load mandatory DV fields using DV logic:
      • technical keys
      • business keys
      • satellite fields
      • temporal fields setting satellite record validity
      • audit mandatory fields
    • load other none-mandatory field as-is (pass-through attribute like batch-Id for meta-audit)

Assumptions

A very small number of assumptions is assumed:
  • Hub and Link must have primary key defined as unique sequence integer (support any DB data type used for storing integer)
  • Temporal validity of the Satellite rows are determined through a single temporal field (Date type) available in the incoming record stream (entrant hop)
  • Business keys must all be Not Nullable

Rules

The ETL flow design must respect these logical rules:
  • Hub can be loaded once its business key(s) are available in input record stream
  • Link can be loaded once all referred Hub primary keys are available in input record stream
  • Satellite can be loaded once referred hub (or Link) primary keys are available in input record stream (normally appended upstream by Load-Hub step)


USER GUIDE

ETL data flow

Data model is loaded by attaching Load Steps through hops :
  1. Load Hub
    • Connect Step through a hop containing all business/natural key(s) in input stream
    • Step will look-up business/natural key(s) and append the associated tech-key (Hub's PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  2. Load Link
    • Connect Step through hop containing all tech keys of the relationship
    • Step will look-up Hub tech keys and append the associated Link tech-key (its PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  3. Load Satellite
    • Connect Step through a hop containing the Hub or Link's PK tech-key, the attributes and the "From-Date" temporal attribute controlling satellite record's lifecycle
    • Step will load new satellite record based on different use-case:
      • Temporal Satellite with Idempotent=true: sat record is loaded unless there is an identical consecutive records (default)
      • Temporal Satellite with Idempotent=false: sat record is loaded unless irrespective of consecutive records
      • Static Satellite (no "From-Date" temporal attribute defined): sat record is loaded unless one already exist for the Hub

Setting common to all Step

General Properties:
PropertyDefinition
Step nameName of the Step must be unique within the Transformation
ConnectionDatabase connection to use
Hub/Link/Sat tableTarget table name to load
Buffer sizeThis determines the number of input rows processed at once. Too large value may generate Query lookup or batch insert causing JDBC driver error (typical values are of range 500 or more)
Audit-related Properties:
PropertyDefinition
Sys-creation Date/TimeThe column holding the Timestamp indicating when record was loaded into DB (leave empty when not used
Load Record SourceColumn holding the Audit record source information (leave empty if not used)
Record source valueValue to store in the column "Load Record source" (may be set by a variable substitution)

Setting specific to Step "Load Hub"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type Business/Natural Keyfield(s) corresponding to the business key(s)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.
PK Sequence-Id settingDefine the technical PK column and which sequence generation method to use

Setting specific to Step "Load Link"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type "Relationship Key"field corresponding to the Relationship keys (i.e. each Hub's Primary key)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.)
PK Sequence-Id settingDefine the technical PK column of the Link and which sequence generation method to use

Setting specific to Step "Load Sat"

PropertyDefinition
Idempotent transformationIdempotent ignores records having identical state at two consecutive time (all attributes are equal). Data Vault standard is Idempotent, but you may have different requirements
Attribute MappingMapping between input stream field and database column
Attribute of type "Foreign-Key to Hub"field representing the Hub's PK (typically provided by an upstream "Load Hub" step)
Attribute of type "From-Date Temporal"field representing the temporal attribute controlling the timeline of each Satellite record. Using an input field instead of a fixed adds more flexibility: if is appropriate, then we simply append it upstream, but other attribute could also be used (ex. when using file input, when using table input, etc..). You use the temporal attribute most adapted for your use-case
Attribute of type Normalfields recorded inside the Satellite. All these control sat record timeline (or lifecycle). Pass-through fields may only be added if their changing values do not impact Satellite lifecycle with regard to the Hub record (the ETL batch-id attribute is a valid example of this)


INSTALLATION

Pre-requisite

 PDI version 5.x

Download and Install

* Download latest archive/package
* Unzip it inside folder: ${PDI_HOME}/plugins/steps

Check installation

* new folder 'DV-loader' should now exist in: ${PDI_HOME}/plugins/steps
* Re-start PDI Spoon UI (${PDI_HOME}/spoon.sh or ${PDI_HOME}/spoon.bat)
* Create a PDI Transformation and add DV-loader steps found under category Experimental:
  • Load Hub
  • Load Link
  • Load Sat

Martin

Tuesday, October 04, 2011

A Source-to-Target self-documentation system

 
In the BI world, the corporate Data warehouse platform gets often overly complex quite rapidly.   The number of source components increases rapidly to reach nearly thousand of entities: 
-: consider 10-15 different OLTP’s sources each having roughly 10-20 entities worth of interest (tables and views)
-: consider also the few File-based sources each accompanied with one or more feed entities
-: now consider that each one of these source entities will normally result in many components created and stored downstream in various form and for various reason
After these considerations in mind, one can see the explosion of entities any average BI team manage and continue maintaining to keep evolving.  This can result in maintenance nightmare when the team is asked to do deep impact analysis of changes and updates done on existing entities.   As such it is important to keep and maintain some Source-to-Target metadata and documentation.
 
To understand the dependencies from downstream elements to upstream ones, we can store all source-to-target relationships using a form of Directed Graph.   Each node in the graph represents an entity either found in the DWH or outside in the source layers.  These entities can be in different format or type, for ex. in the DBMS they would be table or view, in the source layer they can be of type file, etc…   The directed edge of the Graph represents the dependency relationship between the source and the target entities.   
 
Various relational modelling techniques can be used to store such Directed Graph.  One of the most flexible way is to use one table to store the From-To relation (edge) and another table to store each entity node.  This was done here with the table STK_ENTITY_LINK having two FK relationship pointing to STK_ENTITY_NODE.
 
The rest of this presents such a model that can automatically be maintained and populated by a feeder (actually this was done using a custom-built Java feeder capable of finding dependencies from database metadata, from reading Pl/SQL code, from reading ETL tool metadata, etc.)
 
 
 
DATA MODEL :
clip_image003
 
DATA AUTOMATIC FEED :
The Data feeder is the process responsible in populating/refresh data model.  It provides this service through a metadata table FEEDER_META. A number of use-case are supported for various ways to trigger a feed job. One entry in FEEDER_META will generate one self-contained graph of Entity nodes. However multiple entries may generate multiple over-lapping graphs. It is up to the Data feeder to avoid duplicating entity Node by recognizing identical node through its URI.
Furthermore, the meta-info is generic enough to support potential other use-cases with certain constraints.
 
Key Issues to watch for is How to we define Domain:
- Can link one domain per entry in FEEDER_META, so all graph nodes will be associated to same domain.
  • Drawback= nodes in overlapping graph will inherit multiple domains ?
- Can link one domain only for the starting node
  • Drawback= another process will need to complement domain for other nodes and make some decision with Domain clashing.

STK_FEEDER_META
Column Definition Optional
meta_key Primary key N
Meta_name N
Domain_name Domain name of the entity nodes. To link the entire graph generated (or alternatively only the starting node) to the domain
--see issues to resolve
starting_node_name Name of the starting enity node N
starting_node_type View, KView, Table, File, N
starting_node_uri The unique id of the starting node N
starting_node_key Surrogate key linking to the starting node once feed process is complete Y
Feed_usecase Usecase name supported
Feed_source Can be package/table, view name, table.., depending on the Feed_usecase
Feed_extract_sql For use-case Ad-Hoc, used to fecth code to be parsed
Feed_conn_key Points to connection to be used for
Links_direction
Last_run_date N
Y


User-Case 1-View: Graph Feed starting from DB View entity
    • § Starting entity-node point : VIEW identification (db,schema,view)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes:
    • § Metadata fields used: connection info, View name
 
User-Case 2-Cube: Graph Feed starting from OLAP Cube
    • § Starting entity-node point : Cube identification (db,schema,view + cube_id)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes: We could also use other OLAP specialized metadata table
    • § Metadata fields used: connection info, cube id and name
 
User-Case 3-TablePackage: Graph Feed starting from Physical Table loaded by one PL/SQL package
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table. To exclude as dependencies the Physical Table identified, but to include all other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes: We must assume that all other REFERENCED objects are ancestors of the identified Table
    • § Metadata fields used: connection info, Table name, Package name, domain
 
User-Case 4-TableAd-Hoc: Graph Feed starting from Physical Table loaded by some ad-hoc code (SQL or PL/SQL)
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : All Parsing DB objects encountered
    • § Rules: Find Ancestors based on Parsing ad-hoc code fetched by process, and to other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes:
    • § Metadata fields used: connection info, K-View name, connection string, code source location


Martin

p.s. For completeness, I include below all DDL needed for the physical data model
 
 
-------------------------- DDL for Data Model ----------------------
create table STK_ENTITY_DOMAIN(
domain_key NUMBER(9),
domain_name VARCHAR2(50), --
domain_desc VARCHAR2(100),
parent_domain_key NUMBER(9),
created_date DATE,
update_date DATE,
constraint pk_domain primary key (domain_key)
);
comment on table STK_ENTITY_DOMAIN is 'Business Domain for classfying entities into logical grouping';
comment on column STK_ENTITY_DOMAIN.domain_name is 'SIP, OC Efficiency, PMO, Prod Operational Source, etc.';
alter table STK_ENTITY_DOMAIN
add constraint fk_parent_domain foreign key (parent_domain_key) references STK_ENTITY_DOMAIN(domain_key);
create table STK_ENTITY_TYPE(
type_key NUMBER(9),
type_name VARCHAR2(50),
type_desc VARCHAR2(100),
type_cat VARCHAR2(50),
constraint pk_entity_type primary key (type_key)
);
comment on column STK_ENTITY_TYPE.type_name is 'View, Table, K-view, File, External table, Portal/Page, Chart (component), KPI, MOLAP cube...';
create table STK_ENTITY_NODE(
node_key NUMBER(9),
entity_name VARCHAR2(50),
entity_uri VARCHAR2(150),
entity_host VARCHAR2(50),
entity_schema VARCHAR2(50),
metadata_key NUMBER(9),
source_id NUMBER(9),
domain_key NUMBER(9) NOT NULL,
entity_type_key NUMBER(9) NOT NULL,
created_date DATE,
update_date DATE,
constraint pk_entity_node primary key (node_key)
);
comment on column STK_ENTITY_NODE.entity_host is 'host name (for file), database instance (for db objects)';
comment on column STK_ENTITY_NODE.entity_schema is 'schema name for db objects';
comment on column STK_ENTITY_NODE.source_id is 'Id used in source for entity (e.g. dashboard_id for KView, dimension_id,..)';
alter table STK_ENTITY_NODE
add constraint fk_node_domain foreign key (domain_key) references STK_ENTITY_DOMAIN(domain_key);
alter table STK_ENTITY_NODE
add constraint fk_entity_type foreign key (entity_type_key) references STK_ENTITY_TYPE(type_key);
create table STK_ENTITY_LINK(
from_node_key NUMBER(9) NOT NULL,
to_node_key NUMBER(9) NOT NULL,
cross_domain CHAR(1), --
created_date DATE,
update_date DATE,
constraint pk_link primary key (from_node_key,to_node_key)
);
comment on column STK_ENTITY_LINK.cross_domain is 'To flag link crossing two diff domain';
alter table STK_ENTITY_LINK
add constraint fk_from_link_node foreign key (from_node_key) references STK_ENTITY_NODE(node_key);
alter table STK_ENTITY_LINK
add constraint fk_to_link_node foreign key (to_node_key) references STK_ENTITY_NODE(node_key);
create table STK_CONNECTION (
connection_key NUMBER(9),
connection_type VARCHAR2(50),
driver_class VARCHAR2(150),
conn_string VARCHAR2(150),
login VARCHAR2(50),
password VARCHAR2(50),
constraint pk_connection primary key (connection_key)
);
comment on column STK_CONNECTION.conn_string is 'Depends on the type of connection, for DB this corresponds to jdbc_url';
-- job trigger shoud point to the db objects that can be used as a source
--should give the detail on how to construct the graph. (code in sql, pl/sql, )
--
create table STK_FEEDER_META(
meta_key NUMBER(9),
meta_name VARCHAR2(50),
domain_name VARCHAR2(50),
starting_node_name VARCHAR2(50),
starting_node_type VARCHAR2(50),
starting_node_uri VARCHAR2(150),
starting_node_key NUMBER(9),
feed_usecase VARCHAR2(50),
feed_source VARCHAR2(50),
feed_extract_sql VARCHAR2(2000),
feed_conn_key NUMBER(9),
links_direction VARCHAR2(10),
last_run_date DATE,
constraint pk_feeder_meta primary key (meta_key)
);
comment on column STK_FEEDER_META.feed_usecase is 'Values: ''1-View'';''2-KView'';''3-TablePackage;''4-TableAd-Hoc'';';
comment on column STK_FEEDER_META.feed_source is 'Can be package/table, view name, table..';
comment on column STK_FEEDER_META.feed_extract_sql is 'For use-case Ad-Hoc, used to fecth code to be parsed';
/* Additional constraint */
--1- to avoid node with self-link
alter table STK_ENTITY_LINK add constraint no_self_link check (from_node_key <> to_node_key);
--2- Should we have a constraint to avoid two-links between the same entities ?
--------------------------------------------------------------------------
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (1,'Table','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (2,'View','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (3,'KView','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (4,'Synonym','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (5,'Folder','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (6,'File','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (7,'Page','Dashboard');
commit;

Wednesday, May 07, 2008

EAI

Most BI architecture found within enterprise today does not exploit the integration capacity offered by more recent technology. These are more typically leveraged by other application systems in the operational area inside the enterprise IT infrastructure. But before giving some thoughts on the reason why (more on this in a coming blog), first I’d like to give more detail on the concept of application integration. Traditionally, this subject has been referred to as the TLA EAI.

Instead of relying on a more prose-like form, I thought to make use of the mind map diagram for describing the concept. I really like Mind map as a way to illustrate, navigate, and represent knowledge on specific subject.

So here is my mind map of EIA (using FreeMind ) .

1. static view:








2. Dynamic View at Mind42



The mind map is hosted by mindshare.com which is still in beta, so may not be available at all times!


Here is the same dynamic view embedded as iframe :





From my personal observation when it comes to EAI, BI applications typically limit its integration to:
  • For the upstream spectrum (i.e. fetching data in) : Bacth-file or database loading in otherwords ETL processing.
  • For the downsteam spectrum (i.e. disseminating the info out) : things usually are limited to sending flat files to external system at scheduled time. Here I only refer to integration between different system applications without including desktop user-type application, where things are different.


Martin