Saturday, April 15, 2006

JDMAPI: some info

Here's some information I've gathered by looking at the API (JDMAPI spec version 1.0) and at the material found here and there:

1- Mining Function

First the features that are probably central to the API are called Mining Function. These functions relate to the objective one wish to achieve and operate on the individual record or case (collection of related attributes belonging to one entity or transaction used as input for building the data mining model, e.g. customer, purchase, etc..).

These can also be categorized by whether they are supervised (the model is trained and tested using a known target attribute value) or unsupervised (no target value variable is of used).

Looking at the Enum type called MiningFunction, we find these supervised functions :

  • Classification (predict a target attribute of type categorical)
  • Regression (predict a target attribute is of type numerical, i.e. continuous)
  • *Supervised Multi-Target (for model predicting multiple target attribute at once)


For unsupervised functions, we have
  • Clustering (associate record to natural cluster that "closed" )
  • Association (discover hidden interrelationship or correlation among the variables)
  • *Anomaly Detection (identify rare and unexpected case)
  • *Time Series (understand the pattern and forecast the time-ordered series of case)
  • *Feature Extraction (project the set of all attributes into a much smaller set of features useful for visualization capturing important characteristics of the data)



There is also a function applicable to both supervised and unsupervised called Attribute Importance which is helpful to reduce the number of attributes and complexity of the model to build. This function helps identify the most relevant attributes and reduce nose when building mining model.


2- Mining Algorithm

To build mining model for each of these functions, we need to apply specific algorithms. By checking the Enum class called MiningAlgorithm, we find:

  • Decision Tree
  • Feed Forward Neural Net
  • kMeans (a k-means clustering algo)
  • Naive Bayes
  • SVM Classification (for classification)
  • SVM Regression (for regression)
  • *Arima (for time series)
  • *Arma (for time series)
  • *Auto Regression (for time series)
  • *NMF (Non-negative Matrix Factorization algorithm for feature extraction)



3- Mining Task

The API includes a definite set of task used to construct the various mining object (see Mining Named object) : These task defines by the Enum class 'MiningTask' are given next:


  • buildTask (to construct a mining model, see mining function)
  • testTask (for validating the mining model on an independent test dataset, only for supervised)
  • applyTask (for applying the mining model to a new data source)
  • computeStatisticsTask (to get basic statistics of attributes from source physical data)
  • importTask/exportTask (for interacting with external application/framework)



4- Mining Objects

Typically when one submit mining task to the DME (data mining engine), this will generate some persistent objects called NamedObjects. These objects are normally stored in the mining repository and can be saved and restored with the API:

  • buildSetting (used to specify the model to be built, i.e. the mining function, the source data/attribute, the algorithm to be used with its settings, etc.)
  • testMetrics (used to produce a test result)
  • applySetting (used to define the data on which a model needs to be applied)
  • physicalDataSet (a pointer to the original data used for build, test and apply e.g. a database table a file )
  • logicalData (optionally describe the physical data set to change attribute name and their types)
  • model (used to store the resulting mining model)
  • task (used to refer to existing task and their status )
  • taxonomy (used to define hierarchical grouping definition of categorical attribute value)
  • costMatrix (matrix used with classification to associate cost with the actual versus prediction value )


Note: * refers to release 2.0 functions.

Martin

No comments: