Monday, October 03, 2011

Cultural gap: Machine Learning vs Traditional stats


About a decade ago there has been a somewhat controversial paper published by L. Breiman which criticizes the current research trend done by classical/traditional stats community.  In a provoking tone, he compared the research done by the traditional stats community (what he called “Data Modeling Culture”) with the one done by newer machine learning community (referred to as “Algorithmic Modeling community”).   Pretty much everyone recognizes the contribution of the later now, especially in the realm of analytical predictive task with very successful algorithm based on some form of Support Vector Machines SVM or the Random Forest (incidentally created by Breiman himself).   Unfortunately Breiman died not so long after his paper so we will never have the chance to hear his reflection about how things turn out to be nowadays. 

I cannot pretend to follow closely the progression of each culture* and know whether the gap is decreasing or reshaping, or yet whether the “Data Modeling Culture” is catching up and keep on updating their mathematical grounding to better leverage the ever increasing computing power/storage resource made available today.  
*as a side note: although I don’t dedicate enough time to it, I always enjoy going back to the rigor and mathematical grounding of the research/academic world especially after having dealt with too much imprecision, fuzziness, and lack of clarity often characterizing the so-called “real” world project ;-)  
However, I guess that Breiman’s paper helped each community in their own way and contributed in widening each other’s point of view.   This point of view is supported by the growing number of researchers & books treating new algorithmic-based techniques from a more theoretical standpoint and using a more inferential perspective (ex includes work from researcher involved in “statistical learning” like Trevor Hastie, Robert Tibshirani, Jerome Friedman,  Vladimir Cherkassky, Filip Mulier.    

So because of this importance, I decided to highlight some of the elements brought up at the time in Breiman’s article.

Difference in Theoretical Approach

In  essence, most problem typically involve one response variable y which is generated through some form of data generation process attributed to nature functions.   In order to either 1) predict future or unseen y or to 2) understand better how nature operates in generating y, we use a set of input variables contained in the vector x, that we believe have some role in the generation of y.   Obviously, there are other more complex considerations here such as 1) data generation mechanism may originate from more than one underlying process,   2) independency assumption of predictors x, 3) co-dependency or inter-dependencies between the x’s, 4) unobserved input variables z affecting the process, 5) causality vs inference, ..etc.   But these aspects were not the focus of the original paper.

The “Data modelling” culture strive with the utilization of a so-called stochastic data model represented as a simplified model used to replace the nature functions involved in generating y.   These simplified data model are characterized by a chosen function along with their set of associated unknown parameters (BTW these parameters can never be observed) and some form of random noise (e) added especially to explain the difference between the predicted value f(x|m,e) and the real outcome y.    Function used can be simple as linear regression, logistic regression, Cox mode, or more complex like Bayesian methods combined with Markov Chain Monte Carlo.  

The “Algorithmic modelling” considers the nature functions far too complex to be usefully and accurately transposed into a simplified model.  As such, their approach is to consider nature as a black-box and their goal is to find an algorithm function af(x) that operate on x to predict y.  The set of algorithm is growing at fast pace and include things like decision trees, neural nets, svm, random forest, etc.

The main argument against each other camp could be summarized using these terms:
    • “ Data modeling” camp:   there is no way one can interpret a blackbox with the level of complexity seen in “Algorithm modelling”, whereas ease of interpretability in data model allows for useful information to be revealed.
    • “ Algorithmic modeling” camp:   the goal is not interpretability but accurate information, and what’s the level of confidence in the info revealed when data model performed less accurately than “Algorithm modelling” when cross-validated.


Flaws in Data Modelling Culture


A great deal of effort and focus spent in statistical research revolves around Modelling concerns
- What model to apply to data?  -What assumption can be made on the model that supposedly generated data? –what Model parameter inference can be done?  What hypothesis testing to apply?   What confidence interval follows on the parameters inferred?  What distribution of residual sum-of-squares and asymptomatic can be derived?  What is the significant level of coefficients on the basis of model? etc..
This focus on data Modelling, as attractive as it is in terms of the mathematics involved, has the downside of switching the analysis emphasis from the Nature’s mechanism in favour to the Model’s mechanism!  This becomes apparent when one, unintentionally, may later confuse or replace the theoretical model with the real nature’s mechanism that lead to generating the data.  Then, bad or even dangerous conclusion could be drawn especially when the chosen “model is a poor emulation of nature”.  

But the attractiveness and intuitiveness of Model are too strong :  what better can you ask than having a “simple and understandable picture of the relationships between input variables and responses” right in front of you? .   These have resulted in many habits or errors made by some fitting data modeller enthusiasm, which lead them to think their models as infallible and represent the truth!  Here are some examples of mistakes or errors involved (with different level of severity):
      • At worst, confidence in the Model reaches the point where even basic test such as Residual analysis and Goodness-of-fit test can are even ignored!
      • The validity of the form of the model is assumed right or not questionable (for ex. simple linear regression applied without much consideration as whether linearity exist within variables in the underlying data process generator)
      • Test of Goodness-of-fit using multiple correlation coefficient only measure how well the model fits the sample of data and nothing to do with the problem at hand.
      • Goodness-of-fit test suffer from a number of issues, most of the time, simply ignored :
        1. they have the tendency to not reject linearity unless nonlinearity becomes extreme
        2. they test in every directions leading test to be rarely rejected unless lack of fit is extreme
        3. they are often simply not applicable when model tinkering is done like removal of attribute, or adding nonlinearity by combining attributes/variables
      • Residual analysis also has its own limitations:
        1. they cannot uncover lack of fit with more than four to five dimensions!
        2. the interactions between variables produce good residual plots for a number of failed model

Comes next the issue of choice:  what are we supposed to expect by interpreting of a unique model out of potentially many hundreds or thousands!  A direct effect of using a simple yes-no answer to check model goodness is that, implicitly, a number of different Models will be equally as good!  “Data will point with almost equal emphasis on several possible models”.
This is even more relevant nowadays with the sheer number of variables we capture and track and made available to analysis (in the era of Big-data, dataset are much more complex and highly dimensional).   Think of sub-selecting a small number of variables from many available and see how many competing models could we have :   from a pool of 50 variables, how many different subset variables models would compete against each other.  Assuming we’re just looking at the best 5 variables, it becomes a 2.1million problem -->C(50,5) !

The real alternative to Goodness-of-fit is to consider the model predictive accuracy.  This can only be done by applying the function to data from test sample (unused and/or unseen) and compare the model predicted value with the actual measured value, i.e. apply cross-validation!   The closer the two will be the better the model emulates the real process.    This un-biased cross-validation approach (i.e. independent test sample) is the only way to avoid the model to be made to overfit the data at hand and producing overly good predictive accuracy.



Algorithmic Modeling Lessons

In this approach, the focus is shifted from aspects of data modelling to aspects of algorithm’s properties.   These properties range from their “strength” as predictor, their convergent ability (for iterative-type of algo), and their features or capability that made them a good predictor.   
It is interesting to note that in machine learning theory, we usually only have one assumption being imposed on the data process, i.e. that the data is drawn i.i.d. from an unknown multivariate distribution!

The author goes on to describe in more details two of these particularly good predictor (SVM and Random forest) and compare them with other traditional procedure in the context of real and well known research data set.   Not surprisingly, the results all favour the  “Algorithmic modelling” techniques with sometimes large improvement in predictive accuracy.  

I’m not going to present the details here, but let me just re-iterate the incisive highlight of his findings:
    • “Higher predictive accuracy is associated with more reliable information about the underlying data mechanisms”
    • “Weak predictive accuracy can lead to questionable conclusions”
    • “Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism”

Also important are the lessons learnt following the various research development done by the machine learning community.  These lessons are re-iterated here as their importance also resonate with the more traditional community.  
  1. Rashomon “the multiplicity of good models” :
    • Def: There are usually a multitude of different functions to pick from, and all of which would yield similar error rate.  (as expressed in our example above with 5 variables taken out of 50)
    • Effect: this implies much instability in the model selection, as just very minor perturbation in the sample input data may yield a new model of complete different form
    • Context in traditional methods :  these are particularly sensitive to these issues, as it is often considered a good practice to reduce dimensionality of the data set by removing some less important covariates.   
    • Context in machine learning methods:  some are also sensitive (ex. decision trees and Neural nets), however most have techniques to circumvent this such as model aggregation (random forest), bagging or boosting.   Note that similar devise has also been applied to traditional methods like the additive logistic regression.
  2. Occam “the conflict between simplicity and accuracy” :
    • Def: In terms of prediction capacity, accuracy is generally in conflict with the ease of interpretability (simplicity).   Intuitively, it is common sense to think that simple model can hardly capture the insights of complex nature process.
    • Effect:  this favors more complex algo but less interpretable models. 
    • Context in traditional methods :  one should we wary about simply extrapolating or conveying a whole set of information and interpretation simply by looking at the unique model functional form.
    • Context in machine learning methods:  It is incommensurate to try understanding intricacies and limitations of highly-dimensional models common to SVM, or yet trying to delve into the “tangled web that generated a plurality vote from over 100 trees” common with Random forest.
  3. Bellman “the curse of dimensionality” :
    • Def:  It’s been long time recognized and wished to work under smaller dimensionality  (this was done by limiting the number of attribute while trying to preserve as much information as possible) to avoid some adverse effect of very high dimensionality.   Now, most machine learning methods strive for high dimensionality and consider this a blessing.
    • Effect: Reducing dimensionality reduces the total information at hand.
    • Context in traditional methods :  lots of traditional stats test like residual impose see limitation in model test like residual analysis are not suited with many dimensions.
    • Context in machine learning methods:  SVM for example strives for more dimensions, as it favors or augments the likelihood of complete hyperplane separability (applicable in a two-class outcome scenario), Random forest also thrive for more variable without the possibility of over-fitting.

Martin

No comments: