Tuesday, October 04, 2011

A Source-to-Target self-documentation system

 
In the BI world, the corporate Data warehouse platform gets often overly complex quite rapidly.   The number of source components increases rapidly to reach nearly thousand of entities: 
-: consider 10-15 different OLTP’s sources each having roughly 10-20 entities worth of interest (tables and views)
-: consider also the few File-based sources each accompanied with one or more feed entities
-: now consider that each one of these source entities will normally result in many components created and stored downstream in various form and for various reason
After these considerations in mind, one can see the explosion of entities any average BI team manage and continue maintaining to keep evolving.  This can result in maintenance nightmare when the team is asked to do deep impact analysis of changes and updates done on existing entities.   As such it is important to keep and maintain some Source-to-Target metadata and documentation.
 
To understand the dependencies from downstream elements to upstream ones, we can store all source-to-target relationships using a form of Directed Graph.   Each node in the graph represents an entity either found in the DWH or outside in the source layers.  These entities can be in different format or type, for ex. in the DBMS they would be table or view, in the source layer they can be of type file, etc…   The directed edge of the Graph represents the dependency relationship between the source and the target entities.   
 
Various relational modelling techniques can be used to store such Directed Graph.  One of the most flexible way is to use one table to store the From-To relation (edge) and another table to store each entity node.  This was done here with the table STK_ENTITY_LINK having two FK relationship pointing to STK_ENTITY_NODE.
 
The rest of this presents such a model that can automatically be maintained and populated by a feeder (actually this was done using a custom-built Java feeder capable of finding dependencies from database metadata, from reading Pl/SQL code, from reading ETL tool metadata, etc.)
 
 
 
DATA MODEL :
clip_image003
 
DATA AUTOMATIC FEED :
The Data feeder is the process responsible in populating/refresh data model.  It provides this service through a metadata table FEEDER_META. A number of use-case are supported for various ways to trigger a feed job. One entry in FEEDER_META will generate one self-contained graph of Entity nodes. However multiple entries may generate multiple over-lapping graphs. It is up to the Data feeder to avoid duplicating entity Node by recognizing identical node through its URI.
Furthermore, the meta-info is generic enough to support potential other use-cases with certain constraints.
 
Key Issues to watch for is How to we define Domain:
- Can link one domain per entry in FEEDER_META, so all graph nodes will be associated to same domain.
  • Drawback= nodes in overlapping graph will inherit multiple domains ?
- Can link one domain only for the starting node
  • Drawback= another process will need to complement domain for other nodes and make some decision with Domain clashing.

STK_FEEDER_META
Column Definition Optional
meta_key Primary key N
Meta_name N
Domain_name Domain name of the entity nodes. To link the entire graph generated (or alternatively only the starting node) to the domain
--see issues to resolve
starting_node_name Name of the starting enity node N
starting_node_type View, KView, Table, File, N
starting_node_uri The unique id of the starting node N
starting_node_key Surrogate key linking to the starting node once feed process is complete Y
Feed_usecase Usecase name supported
Feed_source Can be package/table, view name, table.., depending on the Feed_usecase
Feed_extract_sql For use-case Ad-Hoc, used to fecth code to be parsed
Feed_conn_key Points to connection to be used for
Links_direction
Last_run_date N
Y


User-Case 1-View: Graph Feed starting from DB View entity
    • § Starting entity-node point : VIEW identification (db,schema,view)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes:
    • § Metadata fields used: connection info, View name
 
User-Case 2-Cube: Graph Feed starting from OLAP Cube
    • § Starting entity-node point : Cube identification (db,schema,view + cube_id)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes: We could also use other OLAP specialized metadata table
    • § Metadata fields used: connection info, cube id and name
 
User-Case 3-TablePackage: Graph Feed starting from Physical Table loaded by one PL/SQL package
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table. To exclude as dependencies the Physical Table identified, but to include all other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes: We must assume that all other REFERENCED objects are ancestors of the identified Table
    • § Metadata fields used: connection info, Table name, Package name, domain
 
User-Case 4-TableAd-Hoc: Graph Feed starting from Physical Table loaded by some ad-hoc code (SQL or PL/SQL)
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : All Parsing DB objects encountered
    • § Rules: Find Ancestors based on Parsing ad-hoc code fetched by process, and to other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes:
    • § Metadata fields used: connection info, K-View name, connection string, code source location


Martin

p.s. For completeness, I include below all DDL needed for the physical data model
 
 
-------------------------- DDL for Data Model ----------------------
create table STK_ENTITY_DOMAIN(
domain_key NUMBER(9),
domain_name VARCHAR2(50), --
domain_desc VARCHAR2(100),
parent_domain_key NUMBER(9),
created_date DATE,
update_date DATE,
constraint pk_domain primary key (domain_key)
);
comment on table STK_ENTITY_DOMAIN is 'Business Domain for classfying entities into logical grouping';
comment on column STK_ENTITY_DOMAIN.domain_name is 'SIP, OC Efficiency, PMO, Prod Operational Source, etc.';
alter table STK_ENTITY_DOMAIN
add constraint fk_parent_domain foreign key (parent_domain_key) references STK_ENTITY_DOMAIN(domain_key);
create table STK_ENTITY_TYPE(
type_key NUMBER(9),
type_name VARCHAR2(50),
type_desc VARCHAR2(100),
type_cat VARCHAR2(50),
constraint pk_entity_type primary key (type_key)
);
comment on column STK_ENTITY_TYPE.type_name is 'View, Table, K-view, File, External table, Portal/Page, Chart (component), KPI, MOLAP cube...';
create table STK_ENTITY_NODE(
node_key NUMBER(9),
entity_name VARCHAR2(50),
entity_uri VARCHAR2(150),
entity_host VARCHAR2(50),
entity_schema VARCHAR2(50),
metadata_key NUMBER(9),
source_id NUMBER(9),
domain_key NUMBER(9) NOT NULL,
entity_type_key NUMBER(9) NOT NULL,
created_date DATE,
update_date DATE,
constraint pk_entity_node primary key (node_key)
);
comment on column STK_ENTITY_NODE.entity_host is 'host name (for file), database instance (for db objects)';
comment on column STK_ENTITY_NODE.entity_schema is 'schema name for db objects';
comment on column STK_ENTITY_NODE.source_id is 'Id used in source for entity (e.g. dashboard_id for KView, dimension_id,..)';
alter table STK_ENTITY_NODE
add constraint fk_node_domain foreign key (domain_key) references STK_ENTITY_DOMAIN(domain_key);
alter table STK_ENTITY_NODE
add constraint fk_entity_type foreign key (entity_type_key) references STK_ENTITY_TYPE(type_key);
create table STK_ENTITY_LINK(
from_node_key NUMBER(9) NOT NULL,
to_node_key NUMBER(9) NOT NULL,
cross_domain CHAR(1), --
created_date DATE,
update_date DATE,
constraint pk_link primary key (from_node_key,to_node_key)
);
comment on column STK_ENTITY_LINK.cross_domain is 'To flag link crossing two diff domain';
alter table STK_ENTITY_LINK
add constraint fk_from_link_node foreign key (from_node_key) references STK_ENTITY_NODE(node_key);
alter table STK_ENTITY_LINK
add constraint fk_to_link_node foreign key (to_node_key) references STK_ENTITY_NODE(node_key);
create table STK_CONNECTION (
connection_key NUMBER(9),
connection_type VARCHAR2(50),
driver_class VARCHAR2(150),
conn_string VARCHAR2(150),
login VARCHAR2(50),
password VARCHAR2(50),
constraint pk_connection primary key (connection_key)
);
comment on column STK_CONNECTION.conn_string is 'Depends on the type of connection, for DB this corresponds to jdbc_url';
-- job trigger shoud point to the db objects that can be used as a source
--should give the detail on how to construct the graph. (code in sql, pl/sql, )
--
create table STK_FEEDER_META(
meta_key NUMBER(9),
meta_name VARCHAR2(50),
domain_name VARCHAR2(50),
starting_node_name VARCHAR2(50),
starting_node_type VARCHAR2(50),
starting_node_uri VARCHAR2(150),
starting_node_key NUMBER(9),
feed_usecase VARCHAR2(50),
feed_source VARCHAR2(50),
feed_extract_sql VARCHAR2(2000),
feed_conn_key NUMBER(9),
links_direction VARCHAR2(10),
last_run_date DATE,
constraint pk_feeder_meta primary key (meta_key)
);
comment on column STK_FEEDER_META.feed_usecase is 'Values: ''1-View'';''2-KView'';''3-TablePackage;''4-TableAd-Hoc'';';
comment on column STK_FEEDER_META.feed_source is 'Can be package/table, view name, table..';
comment on column STK_FEEDER_META.feed_extract_sql is 'For use-case Ad-Hoc, used to fecth code to be parsed';
/* Additional constraint */
--1- to avoid node with self-link
alter table STK_ENTITY_LINK add constraint no_self_link check (from_node_key <> to_node_key);
--2- Should we have a constraint to avoid two-links between the same entities ?
--------------------------------------------------------------------------
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (1,'Table','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (2,'View','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (3,'KView','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (4,'Synonym','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (5,'Folder','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (6,'File','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (7,'Page','Dashboard');
commit;

No comments: