Sunday, September 03, 2006

Data mining background

Data mining has its root, among other thing, in statistics and computer learning. To generalize things greatly, it is observed that depending on your background, these two will tend to view data mining performance very much differently..... Statistical background will rate performance on the statistical significance and inference power and score, whereas the computer scientist tend to measure performance on both the algorithm efficiency and scalability. However I realize that the two approaches really are two sides of the same coin, and this is reflected in the most recent scientific literature.

Various definition of data mining can be found in the literature, but I personally appreciate the more academic point of view than the marketing one commonly marketed by vendors. Here's some explanation excerpt taken from the excellent book « Principles of Data Mining » from David Hand, Heikki Mannila and Padhraic Smyth (MIT Press) (seems to be one of the few data mining books respected inside the statistical community).


It (Data mining) is recognized as well defined procedures that take data as input and produces output in the form of models (a summarized descriptive form of the data globally) or patterns (a descriptive form of some local phenomenon happening on a fraction of the data). The well defined procedure contrast with computational method which does not guarantee to terminate after a finite number of steps as opposed to a data mining procedure.

Data mining is concerned with building empirical models that are not based on some underlying theory about the mechanism through which the data arose but rather models consisting of a description of the observed data. In this sense, a good model is qualified as « generative »in the sense that data generated according to the model will share the same characteristics as the real data from which the model was generated.


They also offer an interesting decomposition of data mining algorithms into orthogonal components which contrasts with the magical and reductionist view marketed by tool vendors (always around the idea of simply applying specific algorithms to magically accomplish your task at hand). In essence, Data mining algorithms intends are to perform specific task on a sample or a complete multivariate datasets. But the task (1st component) is one of many other component that a mining algorithm usually address:

  1. Obviously, the Data mining task in question: whether it be visualization, prediction (classification or regression), clustering, rule pattern discovery, summarization through descriptive model, pattern recognition, etc.

  2. The functional form of the model or the structure of the pattern. Example include linear regression forms, non-linear functions such as the one resulting from Neural network, a decision tree form, a hierarchical clustering model, an association rule, etc. These forms delimit the boundary of what we can expect to approximate or learn.

  3. The score function used to judge on the quality of the fitted model used to summarize the observed data or of the pattern used to characterize a local structure of the data. This score function is what we try to maximize (or minimize) when we fit parameters to our model. It can be based on goodness-of-fit (how well the model describes the observed data), or also on the generalization performance, i.e. how well it describes on the data not yet observed (for prediction purposes).

  4. The search or optimisation method used: the computation procedures and algorithm used for maximizing the score function for a particular model. The search can limit itself to select the best parameters value within the k-parameters space (as in the case of k-th order polynomial function form) when the structure is fixed. And we may have to select first from a set or families of different structures.

  5. The data management technique to be used for storing indexing and retrieving the data. This aspect becomes primordial when it is time to process massive data sets excluding the use of the main memory alone.

The ideal tool would allow you to use different components independently from each other during some data mining activity. This level of agility and flexibility is however not possible in tools today... which may be justified and reasonable when it comes to the components optimization and data management but much less for the functional form and the score function components.

In practice, tools usually offer pre-packaged algorithms from which you can easily fall into the algorithm trap where you are only expected to apply some well established mining algorithm to accomplish magically the specific task at hand. This is the typical black-box paradigm that I've learned to despise in data mining (note that black box abstraction is overall beneficial especially in software OO programming model).

My curiosity simply forces me to step back and discover what's all the different mining algorithms applied to my data. After reviewing some of the literature, I've realized that I came across a lot of theory/practice in statistics during my curriculum, however I cannot say the same for machine learning (although I did some cluster analysis during my master thesis). So in the spirit of increasing your theoretical grasp of the underlying principles, let me give you a list of the books I highly recommend (in ascending order of statistical theory prerequisite):

Martin

p.s. I will try to summarize some the principles found in these books that I consider more than useful for any data mining practitioners to have. Although, blogs are not the ideal platform for such knowledge sharing, it is the most convenient one I have at hand (at least currently).

No comments: