Friday, November 10, 2006

Performance des applications aves des outils ORM

Evidemment certains aspects sont plus importants que d'autres lorsqu'il est temps de considérer les performances des application se connectant avec les bases de données de type relationnel. Si on utilise de surcroît des outils de mapping ORM, alors la mauvaise utilisation de l'outil peut venir brouiller encore plus les cartes.

De façon général, voici par ordre d’importance les éléments qui doivent être considérés lors d’optimisation des applications interagissant avec une base de donné avec un outil ORM:

  1. Le model ER (Entité Relationel) da la base de données Peu importe comment l’application et les modules ORM sont optimisés si le modèle ER de la base de donnée est pourri il sera difficile de produire des applications performantes. Normalement en utilisant des outils ORM, ce problème ne se pose pas puisque la création du modèle suit plus ou moins le modèle UML. Ce problème se pose normalement lorsqu’on dépend du base de données « legacy » qui ne peut pas être modifié.
  2. La configuration du ORM (« fetching strategies », « caching », « report queries »). Cet aspect est critique car une mauvaise configuration résultera forcément en de très mauvaises performances peu importe les autres aspects et avec une dégradation linéaire en fonction du volume de données.
  3. Le model physique de la base de donnée (tables, contraintes, indexes) Présumant que l’aspect précédent est respecté, ceci va garantir de bonnes performances avec l’accroissement du volume de données.
  4. Le tuning général de la base de données (vendor-specific) Cet aspect est intimement lié au type de base de donnée, chaque vendeur fournit différent mécanisme afin d’améliorer la performance des applications de type OLTP ou bien OLAP. Cet aspect est difficilement généralisable et est la responsabilité du DBA.

Il reste toujours des cas d'exceptions mais en général ces principes s'appliquent assez bien pour les application de type OLTP avec base de donnée.

Martin

Sunday, October 22, 2006

Muret de pierre sèche

Ouf!

Cet été j'ai finalement complété mon petit aménagement "style naturel" autour de notre piscine. Le plus dur fût sans contredit le mur de pierre sèche qui a nécessité plusieurs centaines de grosses pierres traînant ici et là. Après la construction de notre maison il restait à faire tout l'aménagement paysager. Nous avons donné quelques contrats mais les coûts exhaustifs de ce type de main d’œuvre saisonnier m'a convaincu d'en faire une partie moi-même.

Le sol sur lequel notre maison est construite est littéralement envahi par les pierres, donc j'ai décidé de faire 1 pierre 2 coups (mauvais jeux de mots): ne pas seulement les transporter pour m'en débarasser (à grosses sueurs), mais de les réutiliser du même coup (à plus grosses sueurs) pour en faire un muret, une terrase et des marches pédestres. Le seul principe important pour le muret est qu'étant donné qu'il n'y a aucun ciment pour assurer la cohésion (il tient que par gravité), il faut s'assurer qu'une pierre repose sur au moins trois autres pierres.

Et après un grand merci à ma copine et à ma mère pour l'esthétique jardin de fleurs (ok, ok aussi à mes enfants qui se sont enfargés plus d'une fois), voilà le résulat de cet aménagement...!

Martin



Monday, October 16, 2006

Statistical General Concept

Update: Most of these notes are work in progress and likely to remain as such for ... well no end date planned!  These notes, originally included as google-doc iFrame,  are thus removed from this blog... may decide to make them public directly as google-doc later on. 



This post is part of the notes I'm gathering from various references providing theoretical background and explanation related to data mining, analytics and computer learning (see this post for the book reference). I'm gathering these notes in hope to be a little smarter in applying and interpreting data mining algorithms taken out-of-the-box from mining tools (I must admit I'm also doing this to serve my endless quest for understanding).

This is actually linked (through an "iframe" tag) to a Google Doc that I keep updating as I face projects making use of new mining algorithms... so this is work-in-progress. I realize that blog is probably not the best way to publish live text, but is the easiest one for me.

This first part gathers basic topics from statistics difficult to classify into a very precise subject... should pretty much serve as a refresher for most people in this domain.


Definition:
  • instances = data objects observed and analysed (sometimes referred to as objects, data points...)
  • variables= characteristics measured (for continuous) or observed (for categorical) for each instance

Notation:
  • n data objects (sample size)
  • X generic input variables. When it is a vector, its component variable j is expressed with subscript: Xj
  • x denotes some observed instance, and when we have p-variables, we denote x1 .. xp as the the real-valued for the 1.. p variables measured on the particular object or instance.
  • xk(i) correspond to the measure for variable Xk of the i-th data objects, where i has 1 .. N.
  • x (in bold) correspond to the vector of n observation of a single variable x.
  • X (capital in bold) correspond to the matrix N x p, containing N input p-vector xp(1..N).

iFrame notes now removed.


Friday, September 29, 2006

I hate UI-type development

Why do I hate having to deal with UI-type development... I don't know but I have some hints:

I simply suck at it!

Although I enjoy using intuitive UI and appreciate the design value of it, I neither have the patience nor the talent to do it! In my view so much time and effort spent simply in designing a nice HTML/JSP page or rich client equivalent (with SWT for ex.) is too frustrating for the end result. I sometimes have to do it when delivering an end-to-end product for clients, and typically most of my time will be wasted on these UI stuff! I guess I could outsource all these, actually I even tried it once... but finding a good designer willing to develop inside JSP page is another challenge on its own!!!

When I first did some RCP stuff in Eclipse, I appreciated all the advanced design patterns available in library such as JFace but I soon got bored and tired again in dealing with all these widgets details consideration, I'm hopeless.

I guess I'll stick to creating domain business layer, service business layer, data access layer, and other more non-visual feature!

Martin

Tuesday, September 19, 2006

Parapente

J'ai eu la chance d'essayer le parapente pendant mes dernières vacances et j'ai adoré les sensations !!! Ca confirme mon attirance pour les sports exploitants la dynamique du vent: j'adore la magie qui opère lorsqu'on est porté par la force du vent...et la prise de vitesse en dépit de l'absence de bruit de moteur.

J'aime la planche à voile mais trouve sa plage d'utilisation beaucoup trop restreinte (du moins ou j'habite en ce moment)... c'est pourquoi j'insiste toujours à y avoir accès lors de voyages familiales vers le sud. La dernière fois (à Punta Cana) par contre fut assez décevante: les journées avec un bon vent furent accompagnés d'interdiction, et pour les journées de levée d'interdiction alors les petites voiles suffisaient à peine à sortir des vagues donc impossible à éxécuter, genre de catch22 de la planche!

Le parapente semble mieux adapté au variation des conditions car on peut y exploiter deux types de vent: dynamique et thermique, ce dernier permet une ascension importante en altitude tandis que le premier est exploité le long de falaise ou parois de montagne! Le site ou j'ai pratiqué était assez bien, quoique loin des altitudes alpiennes, mais offre l'avantage de départ des quatres points cardinaux, assurant un départ garanti (sauf journée orageuse ou trop venteuse). Pour les intéressés:


Voici une petite séquence de la fin de ma descente en tandem.... juste après avoir fait quelques bonnes vrilles à 360 degré, accompagné de je ne sais combien de G, assez pour sentir mon cerveau descendre tout juste au niveau des pieds! Et oui, la fameuse force centrifuge (ou serait-ce centripète... mes cours de physique du secondaire sont quelque peu enfouis sous la masse d'expérience qui ne cesse d'expansionner avec le temps!) agit beaucoup dans ce genre de manoeuvre...




Martin

Tuesday, September 12, 2006

Unit Testing but...

I really appreciate developing using the unit testing approach and as such I always have a JUnit library somewhere in my classpath while building an application. This really brings up my confidence into my code and allow me to refactor at ease without worry about breaking all existing functionality.
However, there are a few stricter recommendations commonly found among the unit tester fanatic or the extreme programmer advocate that I find, to say the least, debatable:


  1. your unit test should focus on a single method of a single class, otherwise it is not really unit test.
  2. always build your unit test first and then your application class afterward.

Point 1 actually emphasizes the term unit, and violating this makes your unit test more like integrated test which I agree. But in my view these tend to be more meaningful and practical.

First of all, methods should be small, precise, have a single clear responsibility and have descriptive name that conveys their purposes. As such I tend to agree with recommendation that limit the size of a single method (R. Johnson gave as a ballpark figure between 30-40 lines of code including all comments, while J. Kerievsky goes as far as recommending ten lines of code or fewer, with the majority of them using one to five lines of code). Keeping methods small and giving them intuitive name produce much easier and self-documented code: I like this idea since it helps reduce the need for documenting your code!

This is why I feel that the principle 1. above is opposed to the "writing short method" approach, since small method do not contain enough complex logic that requires a dedicated unit test on its own.


A junit class that test and validate the effect of each and every single method produces on the state of current object or some other dependents (through Mock-up objects) is often straightforward and thus overkilled! Also, a large number of method may not deserve a full dedicated test on them, since not only their logic is simple but also the impact on state is minimal.

That's why I twist my unit test a bit to make them more integrated test, i.e. test only important methods in the class in relation with their impact on itself and on its dependencies (external library, other piece of my code..). Ok, this is not always possible especially when the dependency library is costly and resource intensive component (then I'll use Mock-up for such case), but in very frequent usage, this allows me to validate and better understand the external library during my test as well as testing my code against its dependency. I find myslef even doing such integrated test with code at the service layer level (above the DAO layer) and validating its effect at the database tier. Using a small memory-based database engine such as HSQLDB helps negating the perfomance penalty of doing this.

As for the point 2, I usually adopt more of a concurrent approach, i.e. draft the application class and once it stabilizes create the test class and making it evolve simultaneously. The first few version of my class/interface are a bit too dynamic and sketchy to really have an accompanying test class. So to limit the need to duplicate my changes in both, I'd rather wait till I'm more comfortable with the class/ interface and then proceed with writing test case.
The only advantage I see in creating the test case first, is when I really don't know how my object's going to be used in the client code. However, in that case, I'd rather use a pencil and sketch some use case scenario beforehand...

Martin

Thursday, September 07, 2006

Handling persistence with ORM

In Java application development, using an Object Relational Mapper to connect to database typically will offer many advantages :

  • avoiding the necessity to code against lower level JDBC API
  • dealing with data persistence concern in a more transparently way more aligned with object oriented code paradigm
  • providing isolation to database vendor specifics allowing easy porting to a number of different DB backend
  • providing additional services support built-in such as connection pooling, caching, etc .
  • reducing the need to be highly skilled in SQL, although ignoring about relational concept and SQL is definitely not realistic.
  • writing less code

On the flip side, I've realized that there are drawbacks as well, such as:

  • providing least common denominator functionality to achieve DB neutrality
  • losing control of the SQL statetement automatically generated for us
  • some performance degradation, no matter what the tool vendor will pretend (ORM will always be one layer on top of JDBC...), however smart caching strategy can mitigate this
  • requiring additional knowledge of the ORM API (so less code to write but more library code to understand and make use of)
  • failing when the application use case is more focused on data reporting and aggregation of large data volume rather than on data entry transaction-based use case.

Typically on the last project I built using Hibernate, I've enjoyed spending more time on the design of a good domain model layer since I've spent less on the persistent logic concern. However, I discovered later through a more realistic usage and data volume test, that it suffered some nasty performance degradation in some specific use case that were not discovered through unit testing (unit testing is only concerned with functionality testing and not performance scaling issues).

Without going into details, the problem had to do with the number of round-trip Hibernate was triggering to fetch object data graph. I had designed some relation (1:1 or N:1) to be eager fetch (always fetch the related object) instead of using Lazy fetching strategy (fetch in database only when necessary). This was good in some scenario, since some data dependencies were always needed and this avoided a second database call to get this dependent object data. However when confronted with getting collection of data, the effect was actually a separate DB call for every single data elements within the collection. So getting a long list of N items resulted in N+1 DB call! Alternative solutions exist for this, but the recommendation is to model most (if not all) object relation using a Lazy strategy and adjust this default by specifying different fetch mode during run-time.

Bottom line, there is no magic bullet especially when it comes with database interaction. We need a good grasp in relation database concept in order to build application interacting with database, no matter what tools or framework you'll be working on.

Martin

Sunday, September 03, 2006

Data mining background

Data mining has its root, among other thing, in statistics and computer learning. To generalize things greatly, it is observed that depending on your background, these two will tend to view data mining performance very much differently..... Statistical background will rate performance on the statistical significance and inference power and score, whereas the computer scientist tend to measure performance on both the algorithm efficiency and scalability. However I realize that the two approaches really are two sides of the same coin, and this is reflected in the most recent scientific literature.

Various definition of data mining can be found in the literature, but I personally appreciate the more academic point of view than the marketing one commonly marketed by vendors. Here's some explanation excerpt taken from the excellent book « Principles of Data Mining » from David Hand, Heikki Mannila and Padhraic Smyth (MIT Press) (seems to be one of the few data mining books respected inside the statistical community).


It (Data mining) is recognized as well defined procedures that take data as input and produces output in the form of models (a summarized descriptive form of the data globally) or patterns (a descriptive form of some local phenomenon happening on a fraction of the data). The well defined procedure contrast with computational method which does not guarantee to terminate after a finite number of steps as opposed to a data mining procedure.

Data mining is concerned with building empirical models that are not based on some underlying theory about the mechanism through which the data arose but rather models consisting of a description of the observed data. In this sense, a good model is qualified as « generative »in the sense that data generated according to the model will share the same characteristics as the real data from which the model was generated.


They also offer an interesting decomposition of data mining algorithms into orthogonal components which contrasts with the magical and reductionist view marketed by tool vendors (always around the idea of simply applying specific algorithms to magically accomplish your task at hand). In essence, Data mining algorithms intends are to perform specific task on a sample or a complete multivariate datasets. But the task (1st component) is one of many other component that a mining algorithm usually address:

  1. Obviously, the Data mining task in question: whether it be visualization, prediction (classification or regression), clustering, rule pattern discovery, summarization through descriptive model, pattern recognition, etc.

  2. The functional form of the model or the structure of the pattern. Example include linear regression forms, non-linear functions such as the one resulting from Neural network, a decision tree form, a hierarchical clustering model, an association rule, etc. These forms delimit the boundary of what we can expect to approximate or learn.

  3. The score function used to judge on the quality of the fitted model used to summarize the observed data or of the pattern used to characterize a local structure of the data. This score function is what we try to maximize (or minimize) when we fit parameters to our model. It can be based on goodness-of-fit (how well the model describes the observed data), or also on the generalization performance, i.e. how well it describes on the data not yet observed (for prediction purposes).

  4. The search or optimisation method used: the computation procedures and algorithm used for maximizing the score function for a particular model. The search can limit itself to select the best parameters value within the k-parameters space (as in the case of k-th order polynomial function form) when the structure is fixed. And we may have to select first from a set or families of different structures.

  5. The data management technique to be used for storing indexing and retrieving the data. This aspect becomes primordial when it is time to process massive data sets excluding the use of the main memory alone.

The ideal tool would allow you to use different components independently from each other during some data mining activity. This level of agility and flexibility is however not possible in tools today... which may be justified and reasonable when it comes to the components optimization and data management but much less for the functional form and the score function components.

In practice, tools usually offer pre-packaged algorithms from which you can easily fall into the algorithm trap where you are only expected to apply some well established mining algorithm to accomplish magically the specific task at hand. This is the typical black-box paradigm that I've learned to despise in data mining (note that black box abstraction is overall beneficial especially in software OO programming model).

My curiosity simply forces me to step back and discover what's all the different mining algorithms applied to my data. After reviewing some of the literature, I've realized that I came across a lot of theory/practice in statistics during my curriculum, however I cannot say the same for machine learning (although I did some cluster analysis during my master thesis). So in the spirit of increasing your theoretical grasp of the underlying principles, let me give you a list of the books I highly recommend (in ascending order of statistical theory prerequisite):

Martin

p.s. I will try to summarize some the principles found in these books that I consider more than useful for any data mining practitioners to have. Although, blogs are not the ideal platform for such knowledge sharing, it is the most convenient one I have at hand (at least currently).

Tuesday, August 15, 2006

Préférences

Avec le temps et surtout l'expérience, j'arrive à mieux savoir mes préférences en terme de travail... voici une petite liste de truc dans le format "me convient versus me convient moins":
  • exécuter versus faire exécuter
  • travail sur un plus petit nombre de tâches concurrentes plus exigeantes versus un plus grand nombre de tâches plus routinières et monotones
  • travail dans un cadre précis et concret versus cadre plus flou et plus théorique
  • travail d'analyse et de raisonnement dépendant de la logique versus travail de recherche dépendant du bagage de connaissance
  • comprendre versus apprendre
  • apprendre grâce à la compréhension versus apprendre grâce à la mémorisation
  • travail de gestion technique des projets/ressources versus travail de gestion administrative
  • travail varié et exploratoire versus travail fixe et récurrent donc redondant.

Cette liste est bien évidemment dynamique mais disons que, depuis quelques années déjà, elle se stabilise assez bien. Les fluctuations semblent être en terme d’addition et non de modification!

Martin


Note: From time to time and usually when my post will be more personal, I'll blog in french. As you may have already noticed, english is not my first language, but rather the language I often use in my professional life.

Thursday, August 03, 2006

Java and Oracle

Oracle has been committed since Oracle8i in integrating Java within its database/application architecture. Being confronted to the development of a particular application highly tied to Oracle, I'm taking this opportunity to review the current state of affair as of Oracle 10g. Here's what I found:

Originally the strategy was to follow a database-centric strategy where merely all software layer would be offered and hosted directly inside the database engine. This controversy strategy (to say the least) has since been reversed from 9i and 10g where some J2EE technologies already integrated inside the database (e.g. EJB container, JSP and servlet) have been desupported.

The focus is now on providing a complete Application Server suite (J2EE compliant) outside the database offering a vast number of services and support, pretty much like IBM WebSphere, BEA WebLogic or JBoss Application Server.

However, this strategy leads to the development (from beginning of Oracle 8i) of a fully functional and compatible Java Virtual Machine inside the database: OracleJVM.

Each of these two components are commented next.


1- OracleJVM

As of 10g release the OracleJVM offers these characteristics:

  • support the J2SE 1.4.2 as specified by Sun Microsystems
  • supports only the headless mode of the Java AWT (i.e. no GUI will be materializable on the server or remotely)
  • java classes (bytecode), resources files and java source code (optional), all reside at the database and stored at the schema level (knows as the Java schema object)
  • each session (user connecting to the database and calling Java code) will see its own private JVM (although for performance reason the implementation does share some part of Java library between session)
  • core Java class libraries are run natively through the use of Ahead-of-time compilation to platform-specific C code before runtime
  • core Java libraries are stored and loaded within the PUBLIC schema and thus available to all other schema
  • application specific Java classes are stored and loaded within the user schema (the owner)
  • besides writing the Java class, compiling it and running it, OracleJVM requires two extra steps in its development/deployment cycle: 1- class needs to be loaded into the database (done through a utility called loadjava, 2- class needs to be published when callable from SQL or PL/SQL (done by creating and compiling a call specification or a.k.a. PL/SQL wrapper) to map the Java's method parameter and return type to Oracle SQL type.
  • granting execution rights is also needed when running a Java classes located in other user's schema
  • class loading is done dynamically as in conventional JVM, however it is done into shared memory, so only one-time loading speed hit is encountered among all users code requiring the class
  • instead of a global classpath defined at runtime to resolve and load all application classes, OracleJVM uses a resolver per each class during class installation specifying in which schema their depending classes reside
  • multi-threading is usually achieved using the embedded scalability of the database server, making Java language-threads needless since they won't help improve the concurrency of the application (this helps avoid complex multi-threading issue inside Java code)
  • OracleJVM offers adapted version of JDBC (called server-side internal driver) which is specially tuned to provide fast access to Oracle data from Java stored procedure, as well as a optimized SQLJ server-side translator.

Execution control:

How do we exactly start off a Java application located inside the Oracle database or in other words what is the equivalent entry point of the static main method in a "normal" application launched by a conventional JVM? This process is referred to in Oracle terminology as a Call and can be done by calling any static method within available loaded and published classes. These published classes must then contain a static method entry point, and are qualified as the Java counterpart of a PL/SQL procedure (referred to by the term Java Stored Procedures).

Some possible scenario of a Java called includes:

  1. a SQL client program running a Java stored procedure
  2. a trigger (i.e. event fired off by defined SQL DML statement) running a Java stored procedure
  3. a PL/SQL program calls a Java code

These Java Stored Procedures are callable from PL/SQL code but can also call PL/SQL procedure.

Some thoughts: Even though I've never played with OracleJVM, I'm yet to be convinced about its advantage: stored Java procedures seems a bit like writing Java code with a procedural mindset? It seems that the only advantage is the possibility to write and centralize business rules that are more portable and powerful than PL/SLQ code and that are available to application written to bypass the Application Server tier?

2- Oracle OC4J J2EE Application Server (or a.k.a. OracleAS):

This server referred to as OC4J, now includes an ever growing number of components (Web server, J2EE technology, ORM with TopLink, Portlet, wireless, Business Intelligence, etc). Its J2EE support includes: JSP, servlet, JSF and ADF framework (using event-based model for web http processing), EJB, JNDI, XML support (schemas, namespace, DOM, SAX, XPath...), Web Services (WSDL, UDDI, SOAP).

The type of applications supported by this infrastructure are usually large and complex, i.e.

  • involve multiple application tier: the central AS tier where the business logic is maintained, a web tier (maybe part of the AS tier) interacting with Web clients, a backend database tier where persistent data is preciously stored, client tier from fat to thin.
  • involve multiple user with different role and rights accessing concurrently common data
  • involve different remote user sites (i.e. implies Web access) and heterogeneous environment
  • involve sophisticated business rule;
  • involve interaction with other EIS enterprise information system through the J2EE connector Architecture (ERP such as SAP, legacy information system)
  • involve web services support

Of course not all application will need all these, but to pull its weight and leverage this considerable software infrastructure weight, the application specification should meet a fair level of complexity before committing to this framework. This technology overweight is probably responsible of the creation of lighter and simpler initiative coming from opens source community (lightweight Framework only requiring a web jsp/servlet container, such as the one I described here).

Martin

Monday, May 29, 2006

J2EE development

Before starting to do any J2EE Web development I did my own research on tools and libraries that would best meet my web transaction-based application requirement (e.g. things like flexibiltiy, simplicity, availability, cost, adoption..). I finally decided to go with the Spring Framework for all integration code and parameterization settings, Hibernate on the data tier to handle the ORM aspect, and Struts for the Web tier. I discovered since then that these exact set of tools are promoted by Source Labs (http://www.sourcelabs.com/?page=software&sub=sash) as the SASH stack. Although I appreciated developing using these libraries, I enjoyed even more the best practices that these frameworks encourage through the adoption of sound principle: loose coupling between component, seperation of concerns, design pattern uage like MVC or dependency inversion, etc.


You have a feeling when you're building applications along these principles that it is well architect and clean, however you enjoy it even more when 5-6 months later the client calls you to update its requirements!

Without going into details, the web application developped with these frameworks usually follows architecture along these lines:
  1. A seperate and "dumb" View layer (JSP page);
  2. A seperate Control layer (action and action setting files in Struts)
  3. A separate Model/Business layer (business layer is using simple POJOs following java bean rules which allow dependency injection with Spring)
  4. A seperate Data layer (thourgh DAO and Hibernate ORM)
  5. An integration and configuration to glue all layers through Spring bean application context file
  6. And finally a simple servlet/JSP container server ( e.g. Tomcat) to service and host the application deployment.

As a good advocate of open source, I put my principle into practice by putting such an application available to anyone intrested, just contact me by email and I'll send you a copy of the project.


Martin

Saturday, April 15, 2006

JDMAPI: some info

Here's some information I've gathered by looking at the API (JDMAPI spec version 1.0) and at the material found here and there:

1- Mining Function

First the features that are probably central to the API are called Mining Function. These functions relate to the objective one wish to achieve and operate on the individual record or case (collection of related attributes belonging to one entity or transaction used as input for building the data mining model, e.g. customer, purchase, etc..).

These can also be categorized by whether they are supervised (the model is trained and tested using a known target attribute value) or unsupervised (no target value variable is of used).

Looking at the Enum type called MiningFunction, we find these supervised functions :

  • Classification (predict a target attribute of type categorical)
  • Regression (predict a target attribute is of type numerical, i.e. continuous)
  • *Supervised Multi-Target (for model predicting multiple target attribute at once)


For unsupervised functions, we have
  • Clustering (associate record to natural cluster that "closed" )
  • Association (discover hidden interrelationship or correlation among the variables)
  • *Anomaly Detection (identify rare and unexpected case)
  • *Time Series (understand the pattern and forecast the time-ordered series of case)
  • *Feature Extraction (project the set of all attributes into a much smaller set of features useful for visualization capturing important characteristics of the data)



There is also a function applicable to both supervised and unsupervised called Attribute Importance which is helpful to reduce the number of attributes and complexity of the model to build. This function helps identify the most relevant attributes and reduce nose when building mining model.


2- Mining Algorithm

To build mining model for each of these functions, we need to apply specific algorithms. By checking the Enum class called MiningAlgorithm, we find:

  • Decision Tree
  • Feed Forward Neural Net
  • kMeans (a k-means clustering algo)
  • Naive Bayes
  • SVM Classification (for classification)
  • SVM Regression (for regression)
  • *Arima (for time series)
  • *Arma (for time series)
  • *Auto Regression (for time series)
  • *NMF (Non-negative Matrix Factorization algorithm for feature extraction)



3- Mining Task

The API includes a definite set of task used to construct the various mining object (see Mining Named object) : These task defines by the Enum class 'MiningTask' are given next:


  • buildTask (to construct a mining model, see mining function)
  • testTask (for validating the mining model on an independent test dataset, only for supervised)
  • applyTask (for applying the mining model to a new data source)
  • computeStatisticsTask (to get basic statistics of attributes from source physical data)
  • importTask/exportTask (for interacting with external application/framework)



4- Mining Objects

Typically when one submit mining task to the DME (data mining engine), this will generate some persistent objects called NamedObjects. These objects are normally stored in the mining repository and can be saved and restored with the API:

  • buildSetting (used to specify the model to be built, i.e. the mining function, the source data/attribute, the algorithm to be used with its settings, etc.)
  • testMetrics (used to produce a test result)
  • applySetting (used to define the data on which a model needs to be applied)
  • physicalDataSet (a pointer to the original data used for build, test and apply e.g. a database table a file )
  • logicalData (optionally describe the physical data set to change attribute name and their types)
  • model (used to store the resulting mining model)
  • task (used to refer to existing task and their status )
  • taxonomy (used to define hierarchical grouping definition of categorical attribute value)
  • costMatrix (matrix used with classification to associate cost with the actual versus prediction value )


Note: * refers to release 2.0 functions.

Martin

Monday, March 27, 2006

Data Mining has now its open interface: JDMAPI

I've always been interested in data mining because it mixes some advanced statistical or mathematical methods with complex data computation algorithm (typically developped for computer learning). On the negative side, its application may have a bad press (sometimes well deserved) because of potential abuse it can lead to... I will not dwell time on identity and privacy sensitive issues, but when the goal is respectful of people privacy's right, one can leverage data mining to bring normal analysis to a much higher level. This is achieved through data induction (let the data speak for itself...) as opposed to data deduction (deducts conclusion based on specific report produced) typically encountered in more classical BI application.

Data mining functionality is now built-in inside Oracle 10g through a standardized JSR (http://www.jcp.org/en/jsr/detail?id=73). Although this first effort is limited in scope, it will ensure application could be developed independent on vendor proprietary API. It is currently being continued through a more complete initiative (JSR-247, http://www.jcp.org/en/jsr/detail?id=247) which will bring more mining functionality and advanced algorithm. Most big players contribute to this standardization effort: Oracle, SAS, Hyperion, SPSS, SAP, IBM, etc.., but yes of course except Microsoft.

The standard API also offers extensibility (each vendor can provide additional functionality not explicitly defined within the standard) and also covers the use of Web Services which will ensure complete independence of platform and language implementation.

More info can be found by googling JDMAPI.

I'll try to analyse this API in more detail and let you know my discoveries...

Sunday, March 19, 2006

From ER to OO

In my previous posts I've been sharing knowledge valuable for people dealing with the technology underlying relational database management system (RDBMS for short). This technology is used to store literally any information held by organization. I've been dealing with this technology since about 1996, and still continue to do so mainly because of its ubiquity in IT world. Relation databases store information using set theory and implement transaction and concurrency control to handle a large number of simultaneous connection, and as such their scope are fairly limited (although most big players are trying to include more functionality and processing flexibility into their engine, e.g. Oracle experiencing with the inclusion of JRE within their database...).


After doing modeling and designing architecture for database for some time, I started designing and developing stuff in Object Oriented language environment (around 2002). At first this can be quite daunting with all the flexibility OO programming can offer... Compared to database modeling where you have a rigid framework and theory guiding your work, OO modeling seems more to stimulate your artistic and creativity ability than your analytical expertise.

To overcome this new paradigm, here's some pragmatic steps I did and applied in learning Java, free of charge (or almost):
  1. Getting and reading good reference documentation, such as the free resource from Bruce Eckel, Thinking in Java. This first step will only help you gaining some knowledge, but to be able to do it yourself in an elegant and flexible way you'll definitely need more experience. After some practice you'll seem to face recurrent problem over and over... this is where step 2 kicks in.

  2. Getting a good reference on Design Pattern, this will teach in developing code with better quality (from aspects such as flexibility, robustness, adaptive, less error prone, etc.) following pattern developed by experienced developer. A good introduction book would be Head First Design Pattern, but to get the real reference document you should go to Gang of Four.

  3. If you're still shy and afraid of downloading a free copy of Eclipse to experiment and code yourself (at this point maybe you should simply reconsider coding ;-), then what is still available to you are millions of lines of quality code (mostly in Java) available in the best open source project, more on that later. However, most likely you'll actually be programming your own stuff relying one or many open source components, at least that's how I did it.

Martin