Tuesday, October 04, 2011

A Source-to-Target self-documentation system

 
In the BI world, the corporate Data warehouse platform gets often overly complex quite rapidly.   The number of source components increases rapidly to reach nearly thousand of entities: 
-: consider 10-15 different OLTP’s sources each having roughly 10-20 entities worth of interest (tables and views)
-: consider also the few File-based sources each accompanied with one or more feed entities
-: now consider that each one of these source entities will normally result in many components created and stored downstream in various form and for various reason
After these considerations in mind, one can see the explosion of entities any average BI team manage and continue maintaining to keep evolving.  This can result in maintenance nightmare when the team is asked to do deep impact analysis of changes and updates done on existing entities.   As such it is important to keep and maintain some Source-to-Target metadata and documentation.
 
To understand the dependencies from downstream elements to upstream ones, we can store all source-to-target relationships using a form of Directed Graph.   Each node in the graph represents an entity either found in the DWH or outside in the source layers.  These entities can be in different format or type, for ex. in the DBMS they would be table or view, in the source layer they can be of type file, etc…   The directed edge of the Graph represents the dependency relationship between the source and the target entities.   
 
Various relational modelling techniques can be used to store such Directed Graph.  One of the most flexible way is to use one table to store the From-To relation (edge) and another table to store each entity node.  This was done here with the table STK_ENTITY_LINK having two FK relationship pointing to STK_ENTITY_NODE.
 
The rest of this presents such a model that can automatically be maintained and populated by a feeder (actually this was done using a custom-built Java feeder capable of finding dependencies from database metadata, from reading Pl/SQL code, from reading ETL tool metadata, etc.)
 
 
 
DATA MODEL :
clip_image003
 
DATA AUTOMATIC FEED :
The Data feeder is the process responsible in populating/refresh data model.  It provides this service through a metadata table FEEDER_META. A number of use-case are supported for various ways to trigger a feed job. One entry in FEEDER_META will generate one self-contained graph of Entity nodes. However multiple entries may generate multiple over-lapping graphs. It is up to the Data feeder to avoid duplicating entity Node by recognizing identical node through its URI.
Furthermore, the meta-info is generic enough to support potential other use-cases with certain constraints.
 
Key Issues to watch for is How to we define Domain:
- Can link one domain per entry in FEEDER_META, so all graph nodes will be associated to same domain.
  • Drawback= nodes in overlapping graph will inherit multiple domains ?
- Can link one domain only for the starting node
  • Drawback= another process will need to complement domain for other nodes and make some decision with Domain clashing.

STK_FEEDER_META
Column Definition Optional
meta_key Primary key N
Meta_name N
Domain_name Domain name of the entity nodes. To link the entire graph generated (or alternatively only the starting node) to the domain
--see issues to resolve
starting_node_name Name of the starting enity node N
starting_node_type View, KView, Table, File, N
starting_node_uri The unique id of the starting node N
starting_node_key Surrogate key linking to the starting node once feed process is complete Y
Feed_usecase Usecase name supported
Feed_source Can be package/table, view name, table.., depending on the Feed_usecase
Feed_extract_sql For use-case Ad-Hoc, used to fecth code to be parsed
Feed_conn_key Points to connection to be used for
Links_direction
Last_run_date N
Y


User-Case 1-View: Graph Feed starting from DB View entity
    • § Starting entity-node point : VIEW identification (db,schema,view)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes:
    • § Metadata fields used: connection info, View name
 
User-Case 2-Cube: Graph Feed starting from OLAP Cube
    • § Starting entity-node point : Cube identification (db,schema,view + cube_id)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table (recursively)
    • § Notes: We could also use other OLAP specialized metadata table
    • § Metadata fields used: connection info, cube id and name
 
User-Case 3-TablePackage: Graph Feed starting from Physical Table loaded by one PL/SQL package
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : First physical tables encounter
    • § Rules: Find Ancestors based on USER_DEPENDENCIES metadata table. To exclude as dependencies the Physical Table identified, but to include all other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes: We must assume that all other REFERENCED objects are ancestors of the identified Table
    • § Metadata fields used: connection info, Table name, Package name, domain
 
User-Case 4-TableAd-Hoc: Graph Feed starting from Physical Table loaded by some ad-hoc code (SQL or PL/SQL)
    • § Starting entity-node point : Table identified (db,schema,table)
    • § Ending entity-node point : All Parsing DB objects encountered
    • § Rules: Find Ancestors based on Parsing ad-hoc code fetched by process, and to other objects dependencies (recursively), i.e. the process must not stop at View objects but find all base tables.
    • § Notes:
    • § Metadata fields used: connection info, K-View name, connection string, code source location


Martin

p.s. For completeness, I include below all DDL needed for the physical data model
 
 
-------------------------- DDL for Data Model ----------------------
create table STK_ENTITY_DOMAIN(
domain_key NUMBER(9),
domain_name VARCHAR2(50), --
domain_desc VARCHAR2(100),
parent_domain_key NUMBER(9),
created_date DATE,
update_date DATE,
constraint pk_domain primary key (domain_key)
);
comment on table STK_ENTITY_DOMAIN is 'Business Domain for classfying entities into logical grouping';
comment on column STK_ENTITY_DOMAIN.domain_name is 'SIP, OC Efficiency, PMO, Prod Operational Source, etc.';
alter table STK_ENTITY_DOMAIN
add constraint fk_parent_domain foreign key (parent_domain_key) references STK_ENTITY_DOMAIN(domain_key);
create table STK_ENTITY_TYPE(
type_key NUMBER(9),
type_name VARCHAR2(50),
type_desc VARCHAR2(100),
type_cat VARCHAR2(50),
constraint pk_entity_type primary key (type_key)
);
comment on column STK_ENTITY_TYPE.type_name is 'View, Table, K-view, File, External table, Portal/Page, Chart (component), KPI, MOLAP cube...';
create table STK_ENTITY_NODE(
node_key NUMBER(9),
entity_name VARCHAR2(50),
entity_uri VARCHAR2(150),
entity_host VARCHAR2(50),
entity_schema VARCHAR2(50),
metadata_key NUMBER(9),
source_id NUMBER(9),
domain_key NUMBER(9) NOT NULL,
entity_type_key NUMBER(9) NOT NULL,
created_date DATE,
update_date DATE,
constraint pk_entity_node primary key (node_key)
);
comment on column STK_ENTITY_NODE.entity_host is 'host name (for file), database instance (for db objects)';
comment on column STK_ENTITY_NODE.entity_schema is 'schema name for db objects';
comment on column STK_ENTITY_NODE.source_id is 'Id used in source for entity (e.g. dashboard_id for KView, dimension_id,..)';
alter table STK_ENTITY_NODE
add constraint fk_node_domain foreign key (domain_key) references STK_ENTITY_DOMAIN(domain_key);
alter table STK_ENTITY_NODE
add constraint fk_entity_type foreign key (entity_type_key) references STK_ENTITY_TYPE(type_key);
create table STK_ENTITY_LINK(
from_node_key NUMBER(9) NOT NULL,
to_node_key NUMBER(9) NOT NULL,
cross_domain CHAR(1), --
created_date DATE,
update_date DATE,
constraint pk_link primary key (from_node_key,to_node_key)
);
comment on column STK_ENTITY_LINK.cross_domain is 'To flag link crossing two diff domain';
alter table STK_ENTITY_LINK
add constraint fk_from_link_node foreign key (from_node_key) references STK_ENTITY_NODE(node_key);
alter table STK_ENTITY_LINK
add constraint fk_to_link_node foreign key (to_node_key) references STK_ENTITY_NODE(node_key);
create table STK_CONNECTION (
connection_key NUMBER(9),
connection_type VARCHAR2(50),
driver_class VARCHAR2(150),
conn_string VARCHAR2(150),
login VARCHAR2(50),
password VARCHAR2(50),
constraint pk_connection primary key (connection_key)
);
comment on column STK_CONNECTION.conn_string is 'Depends on the type of connection, for DB this corresponds to jdbc_url';
-- job trigger shoud point to the db objects that can be used as a source
--should give the detail on how to construct the graph. (code in sql, pl/sql, )
--
create table STK_FEEDER_META(
meta_key NUMBER(9),
meta_name VARCHAR2(50),
domain_name VARCHAR2(50),
starting_node_name VARCHAR2(50),
starting_node_type VARCHAR2(50),
starting_node_uri VARCHAR2(150),
starting_node_key NUMBER(9),
feed_usecase VARCHAR2(50),
feed_source VARCHAR2(50),
feed_extract_sql VARCHAR2(2000),
feed_conn_key NUMBER(9),
links_direction VARCHAR2(10),
last_run_date DATE,
constraint pk_feeder_meta primary key (meta_key)
);
comment on column STK_FEEDER_META.feed_usecase is 'Values: ''1-View'';''2-KView'';''3-TablePackage;''4-TableAd-Hoc'';';
comment on column STK_FEEDER_META.feed_source is 'Can be package/table, view name, table..';
comment on column STK_FEEDER_META.feed_extract_sql is 'For use-case Ad-Hoc, used to fecth code to be parsed';
/* Additional constraint */
--1- to avoid node with self-link
alter table STK_ENTITY_LINK add constraint no_self_link check (from_node_key <> to_node_key);
--2- Should we have a constraint to avoid two-links between the same entities ?
--------------------------------------------------------------------------
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (1,'Table','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (2,'View','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (3,'KView','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (4,'Synonym','Database');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (5,'Folder','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (6,'File','OS');
insert into STK_ENTITY_TYPE(type_key,type_name,type_cat) values (7,'Page','Dashboard');
commit;

Monday, October 03, 2011

Cultural gap: Machine Learning vs Traditional stats


About a decade ago there has been a somewhat controversial paper published by L. Breiman which criticizes the current research trend done by classical/traditional stats community.  In a provoking tone, he compared the research done by the traditional stats community (what he called “Data Modeling Culture”) with the one done by newer machine learning community (referred to as “Algorithmic Modeling community”).   Pretty much everyone recognizes the contribution of the later now, especially in the realm of analytical predictive task with very successful algorithm based on some form of Support Vector Machines SVM or the Random Forest (incidentally created by Breiman himself).   Unfortunately Breiman died not so long after his paper so we will never have the chance to hear his reflection about how things turn out to be nowadays. 

I cannot pretend to follow closely the progression of each culture* and know whether the gap is decreasing or reshaping, or yet whether the “Data Modeling Culture” is catching up and keep on updating their mathematical grounding to better leverage the ever increasing computing power/storage resource made available today.  
*as a side note: although I don’t dedicate enough time to it, I always enjoy going back to the rigor and mathematical grounding of the research/academic world especially after having dealt with too much imprecision, fuzziness, and lack of clarity often characterizing the so-called “real” world project ;-)  
However, I guess that Breiman’s paper helped each community in their own way and contributed in widening each other’s point of view.   This point of view is supported by the growing number of researchers & books treating new algorithmic-based techniques from a more theoretical standpoint and using a more inferential perspective (ex includes work from researcher involved in “statistical learning” like Trevor Hastie, Robert Tibshirani, Jerome Friedman,  Vladimir Cherkassky, Filip Mulier.    

So because of this importance, I decided to highlight some of the elements brought up at the time in Breiman’s article.

Difference in Theoretical Approach

In  essence, most problem typically involve one response variable y which is generated through some form of data generation process attributed to nature functions.   In order to either 1) predict future or unseen y or to 2) understand better how nature operates in generating y, we use a set of input variables contained in the vector x, that we believe have some role in the generation of y.   Obviously, there are other more complex considerations here such as 1) data generation mechanism may originate from more than one underlying process,   2) independency assumption of predictors x, 3) co-dependency or inter-dependencies between the x’s, 4) unobserved input variables z affecting the process, 5) causality vs inference, ..etc.   But these aspects were not the focus of the original paper.

The “Data modelling” culture strive with the utilization of a so-called stochastic data model represented as a simplified model used to replace the nature functions involved in generating y.   These simplified data model are characterized by a chosen function along with their set of associated unknown parameters (BTW these parameters can never be observed) and some form of random noise (e) added especially to explain the difference between the predicted value f(x|m,e) and the real outcome y.    Function used can be simple as linear regression, logistic regression, Cox mode, or more complex like Bayesian methods combined with Markov Chain Monte Carlo.  

The “Algorithmic modelling” considers the nature functions far too complex to be usefully and accurately transposed into a simplified model.  As such, their approach is to consider nature as a black-box and their goal is to find an algorithm function af(x) that operate on x to predict y.  The set of algorithm is growing at fast pace and include things like decision trees, neural nets, svm, random forest, etc.

The main argument against each other camp could be summarized using these terms:
    • “ Data modeling” camp:   there is no way one can interpret a blackbox with the level of complexity seen in “Algorithm modelling”, whereas ease of interpretability in data model allows for useful information to be revealed.
    • “ Algorithmic modeling” camp:   the goal is not interpretability but accurate information, and what’s the level of confidence in the info revealed when data model performed less accurately than “Algorithm modelling” when cross-validated.


Flaws in Data Modelling Culture


A great deal of effort and focus spent in statistical research revolves around Modelling concerns
- What model to apply to data?  -What assumption can be made on the model that supposedly generated data? –what Model parameter inference can be done?  What hypothesis testing to apply?   What confidence interval follows on the parameters inferred?  What distribution of residual sum-of-squares and asymptomatic can be derived?  What is the significant level of coefficients on the basis of model? etc..
This focus on data Modelling, as attractive as it is in terms of the mathematics involved, has the downside of switching the analysis emphasis from the Nature’s mechanism in favour to the Model’s mechanism!  This becomes apparent when one, unintentionally, may later confuse or replace the theoretical model with the real nature’s mechanism that lead to generating the data.  Then, bad or even dangerous conclusion could be drawn especially when the chosen “model is a poor emulation of nature”.  

But the attractiveness and intuitiveness of Model are too strong :  what better can you ask than having a “simple and understandable picture of the relationships between input variables and responses” right in front of you? .   These have resulted in many habits or errors made by some fitting data modeller enthusiasm, which lead them to think their models as infallible and represent the truth!  Here are some examples of mistakes or errors involved (with different level of severity):
      • At worst, confidence in the Model reaches the point where even basic test such as Residual analysis and Goodness-of-fit test can are even ignored!
      • The validity of the form of the model is assumed right or not questionable (for ex. simple linear regression applied without much consideration as whether linearity exist within variables in the underlying data process generator)
      • Test of Goodness-of-fit using multiple correlation coefficient only measure how well the model fits the sample of data and nothing to do with the problem at hand.
      • Goodness-of-fit test suffer from a number of issues, most of the time, simply ignored :
        1. they have the tendency to not reject linearity unless nonlinearity becomes extreme
        2. they test in every directions leading test to be rarely rejected unless lack of fit is extreme
        3. they are often simply not applicable when model tinkering is done like removal of attribute, or adding nonlinearity by combining attributes/variables
      • Residual analysis also has its own limitations:
        1. they cannot uncover lack of fit with more than four to five dimensions!
        2. the interactions between variables produce good residual plots for a number of failed model

Comes next the issue of choice:  what are we supposed to expect by interpreting of a unique model out of potentially many hundreds or thousands!  A direct effect of using a simple yes-no answer to check model goodness is that, implicitly, a number of different Models will be equally as good!  “Data will point with almost equal emphasis on several possible models”.
This is even more relevant nowadays with the sheer number of variables we capture and track and made available to analysis (in the era of Big-data, dataset are much more complex and highly dimensional).   Think of sub-selecting a small number of variables from many available and see how many competing models could we have :   from a pool of 50 variables, how many different subset variables models would compete against each other.  Assuming we’re just looking at the best 5 variables, it becomes a 2.1million problem -->C(50,5) !

The real alternative to Goodness-of-fit is to consider the model predictive accuracy.  This can only be done by applying the function to data from test sample (unused and/or unseen) and compare the model predicted value with the actual measured value, i.e. apply cross-validation!   The closer the two will be the better the model emulates the real process.    This un-biased cross-validation approach (i.e. independent test sample) is the only way to avoid the model to be made to overfit the data at hand and producing overly good predictive accuracy.



Algorithmic Modeling Lessons

In this approach, the focus is shifted from aspects of data modelling to aspects of algorithm’s properties.   These properties range from their “strength” as predictor, their convergent ability (for iterative-type of algo), and their features or capability that made them a good predictor.   
It is interesting to note that in machine learning theory, we usually only have one assumption being imposed on the data process, i.e. that the data is drawn i.i.d. from an unknown multivariate distribution!

The author goes on to describe in more details two of these particularly good predictor (SVM and Random forest) and compare them with other traditional procedure in the context of real and well known research data set.   Not surprisingly, the results all favour the  “Algorithmic modelling” techniques with sometimes large improvement in predictive accuracy.  

I’m not going to present the details here, but let me just re-iterate the incisive highlight of his findings:
    • “Higher predictive accuracy is associated with more reliable information about the underlying data mechanisms”
    • “Weak predictive accuracy can lead to questionable conclusions”
    • “Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism”

Also important are the lessons learnt following the various research development done by the machine learning community.  These lessons are re-iterated here as their importance also resonate with the more traditional community.  
  1. Rashomon “the multiplicity of good models” :
    • Def: There are usually a multitude of different functions to pick from, and all of which would yield similar error rate.  (as expressed in our example above with 5 variables taken out of 50)
    • Effect: this implies much instability in the model selection, as just very minor perturbation in the sample input data may yield a new model of complete different form
    • Context in traditional methods :  these are particularly sensitive to these issues, as it is often considered a good practice to reduce dimensionality of the data set by removing some less important covariates.   
    • Context in machine learning methods:  some are also sensitive (ex. decision trees and Neural nets), however most have techniques to circumvent this such as model aggregation (random forest), bagging or boosting.   Note that similar devise has also been applied to traditional methods like the additive logistic regression.
  2. Occam “the conflict between simplicity and accuracy” :
    • Def: In terms of prediction capacity, accuracy is generally in conflict with the ease of interpretability (simplicity).   Intuitively, it is common sense to think that simple model can hardly capture the insights of complex nature process.
    • Effect:  this favors more complex algo but less interpretable models. 
    • Context in traditional methods :  one should we wary about simply extrapolating or conveying a whole set of information and interpretation simply by looking at the unique model functional form.
    • Context in machine learning methods:  It is incommensurate to try understanding intricacies and limitations of highly-dimensional models common to SVM, or yet trying to delve into the “tangled web that generated a plurality vote from over 100 trees” common with Random forest.
  3. Bellman “the curse of dimensionality” :
    • Def:  It’s been long time recognized and wished to work under smaller dimensionality  (this was done by limiting the number of attribute while trying to preserve as much information as possible) to avoid some adverse effect of very high dimensionality.   Now, most machine learning methods strive for high dimensionality and consider this a blessing.
    • Effect: Reducing dimensionality reduces the total information at hand.
    • Context in traditional methods :  lots of traditional stats test like residual impose see limitation in model test like residual analysis are not suited with many dimensions.
    • Context in machine learning methods:  SVM for example strives for more dimensions, as it favors or augments the likelihood of complete hyperplane separability (applicable in a two-class outcome scenario), Random forest also thrive for more variable without the possibility of over-fitting.

Martin