Monday, April 09, 2007
Consultation informatique
Par contre, ma deuxième vie professionnelle est plus stimulante et comporte son lot de défis. Dans cette seconde vie, je fais de la consultation informatique (freelance) dans mon domaine d'expertise: i.e. tout ce qui gravite autour des tâches d'analyse, de design et développement d'application développée avec Java et exigeant des interactions avec des systèmes de base de données.
Mon cheminement professionnel m'a permis de maîtriser ces deux spécialités comportants certaines incompatibilités, soit l'implantation de systèmes de base de donnée basés sur un design de type relationnelles des données, et l'implantation d'applications basées sur un design de type Orienté Objet (dans mon cas c'est Java). Il faut noter que certaines incompatibilités sont quelques fois une conséquence de la mauvaise fois des gens de chacun des deux clans, mais ceci est une autre histoire....
La connaissance de ces deux spécialités s'avère d'autant plus utile avec l'explosion d'application sur l'internet dont les caractéristiques justifient l'utilisation de ces deux spécialités.
Tout cela pour dire, que je suis bientôt mûr pour un changement de carrière ... mon emploi chez IMS étant moins stimulant car nous servons une industrie beaucoup plus conservatrice avec de long cycle de vie des applications logiciels (je travail en ce momnent sur une application vieille de plus de 30a ns fonctionnant sur le Mainframe). Il faut dire que j'arrivais du monde un peu fou de la télécommunication qui est assez intense et garnie en terme d'innovation.
Martin
Tuesday, February 06, 2007
JDBC convenient programming with Spring
Most of my database-related code usually made use of Hibernate because the sizing and the extend of the domain model simply justified leveraging the complexity of this framework.
However, in my current project I decided to give a shot at Spring jdbc support packages mostly because my domain model is simple enough (less than 10 entities with basic relationship) and because... well exploring more of the Spring library!
Although the library only really offers some thin wrapper around jdbc api (as opposed to a complete ORM solution), it does it in a way that you can create your data access code in a more object oriented fashion while keeping a close access to the underlying jdbc low-level api.
To illustrate this, I've created some helper classes (one per each business object entity) wrapping Spring jdbc data access object (e.g. SQLUpdate, JdbcTemplate, MappingSqlQuery) and offering a convenient way to centralize all sql-related string (sql command, table name and field, etc..).
On top of these classes, you can actually implement all generic sql access code (e.g. delete by id) and generic sql commands. Here's how look like the superclass, refer here as BaseSqlHelper:
And Here's one example of how one could implement a particular subclass (in this specific example the subclass handles sql for the User business entity object):
What does this offer you:
1- get all your sql strings contralized in one convenient place
2- benefit from code re-use by moving-up all your common fields (e.g. id, createDate,..) and common sql operations (e.g. delete from ... where id=, select ... from ... where id=) for all business entities into the superclass
3- exploit the MapRow capability to treat query response as real business entities and not merely as data field.
4- Ease for accomodating new fields and remove existing ones
Martin
Thursday, January 04, 2007
Utilisation d'Hibernate
Par exemple concernant l'utilisation du Paramètre Lazy dans le mapping, plusieurs applications sont configurées avec lazy=false. Ceci facilite évidemment la question du fetching des graphes d'objet, mais ceci à un prix et ce prix s'appelle performance!
Recommandation:
Pratiquement toutes les entités et les associations de compositions devraient être configurées de façon Lazy (i.e. lazy=”true”). Il ne faut pas confondre le paramètre lazy et le paramètre fetch, le premier indique seulement à Hibernate s’il pourra créer un proxy (qui exigera une initialisation en session ouverte) pour la classe en question ou la collection, tandis que dernier indique comment les relations doivent être fetchées (eager ou non).
En utilisant lazy=”false” ceci a comme « side-effect » de pre-fetché toutes les dépendances de l’entité et ultimement la base de donnée complète sera chargée en objet ! Cela peut être pratique puisqu’on n'a pas à se poser la question de devoir initialiser ou pas les dépendances, mais ça devient vite catastrophique lorsque le volume de donnée est important.
La décision de fetcher ou ne pas fetcher (eager fetch) est une stratégie qui doit être défini en runtime puisque différente utilisation exige différent accès aux données. La couche DAO peut être utilisée en ce sens en fournissant des options de loader ou ne pas loader les dépendances (avec Hibernate.initialize()) avant de fermer la session Hibernate. Configurer avec lazy=”false” élimine tout simplement cette flexibilité.
Si pour une raison quelconque, une dépendance d’une entité doit toujours être pre-fetché, alors utiliser l’option fetch=join dans le mapping mais en gardant le lazy à false pour cette entité dépendante.
L’auteur d’Hibernate a lui-même reconnu ce fait et recommande depuis Hibernate 3.0 que toutes les mapping des class et des collections soient lazy= ‘true’ (ceci est maintenant le défaut utilisé).
Martin
Friday, November 10, 2006
Performance des applications aves des outils ORM
Evidemment certains aspects sont plus importants que d'autres lorsqu'il est temps de considérer les performances des application se connectant avec les bases de données de type relationnel. Si on utilise de surcroît des outils de mapping ORM, alors la mauvaise utilisation de l'outil peut venir brouiller encore plus les cartes.
De façon général, voici par ordre d’importance les éléments qui doivent être considérés lors d’optimisation des applications interagissant avec une base de donné avec un outil ORM:
- Le model ER (Entité Relationel) da la base de données Peu importe comment l’application et les modules ORM sont optimisés si le modèle ER de la base de donnée est pourri il sera difficile de produire des applications performantes. Normalement en utilisant des outils ORM, ce problème ne se pose pas puisque la création du modèle suit plus ou moins le modèle UML. Ce problème se pose normalement lorsqu’on dépend du base de données « legacy » qui ne peut pas être modifié.
- La configuration du ORM (« fetching strategies », « caching », « report queries »). Cet aspect est critique car une mauvaise configuration résultera forcément en de très mauvaises performances peu importe les autres aspects et avec une dégradation linéaire en fonction du volume de données.
- Le model physique de la base de donnée (tables, contraintes, indexes) Présumant que l’aspect précédent est respecté, ceci va garantir de bonnes performances avec l’accroissement du volume de données.
- Le tuning général de la base de données (vendor-specific) Cet aspect est intimement lié au type de base de donnée, chaque vendeur fournit différent mécanisme afin d’améliorer la performance des applications de type OLTP ou bien OLAP. Cet aspect est difficilement généralisable et est la responsabilité du DBA.
Il reste toujours des cas d'exceptions mais en général ces principes s'appliquent assez bien pour les application de type OLTP avec base de donnée.
Martin
Sunday, October 22, 2006
Muret de pierre sèche
Cet été j'ai finalement complété mon petit aménagement "style naturel" autour de notre piscine. Le plus dur fût sans contredit le mur de pierre sèche qui a nécessité plusieurs centaines de grosses pierres traînant ici et là. Après la construction de notre maison il restait à faire tout l'aménagement paysager. Nous avons donné quelques contrats mais les coûts exhaustifs de ce type de main d’œuvre saisonnier m'a convaincu d'en faire une partie moi-même.
Le sol sur lequel notre maison est construite est littéralement envahi par les pierres, donc j'ai décidé de faire 1 pierre 2 coups (mauvais jeux de mots): ne pas seulement les transporter pour m'en débarasser (à grosses sueurs), mais de les réutiliser du même coup (à plus grosses sueurs) pour en faire un muret, une terrase et des marches pédestres. Le seul principe important pour le muret est qu'étant donné qu'il n'y a aucun ciment pour assurer la cohésion (il tient que par gravité), il faut s'assurer qu'une pierre repose sur au moins trois autres pierres.
Et après un grand merci à ma copine et à ma mère pour l'esthétique jardin de fleurs (ok, ok aussi à mes enfants qui se sont enfargés plus d'une fois), voilà le résulat de cet aménagement...!
Martin



Monday, October 16, 2006
Statistical General Concept
This post is part of the notes I'm gathering from various references providing theoretical background and explanation related to data mining, analytics and computer learning (see this post for the book reference). I'm gathering these notes in hope to be a little smarter in applying and interpreting data mining algorithms taken out-of-the-box from mining tools (I must admit I'm also doing this to serve my endless quest for understanding).
This is actually linked (through an "iframe" tag) to a Google Doc that I keep updating as I face projects making use of new mining algorithms... so this is work-in-progress. I realize that blog is probably not the best way to publish live text, but is the easiest one for me.
This first part gathers basic topics from statistics difficult to classify into a very precise subject... should pretty much serve as a refresher for most people in this domain.
- instances = data objects observed and analysed (sometimes referred to as objects, data points...)
- variables= characteristics measured (for continuous) or observed (for categorical) for each instance
- n data objects (sample size)
- X generic input variables. When it is a vector, its component variable j is expressed with subscript: Xj
- x denotes some observed instance, and when we have p-variables, we denote x1 .. xp as the the real-valued for the 1.. p variables measured on the particular object or instance.
- xk(i) correspond to the measure for variable Xk of the i-th data objects, where i has 1 .. N.
- x (in bold) correspond to the vector of n observation of a single variable x.
- X (capital in bold) correspond to the matrix N x p, containing N input p-vector xp(1..N).
Friday, September 29, 2006
I hate UI-type development
I simply suck at it!
Although I enjoy using intuitive UI and appreciate the design value of it, I neither have the patience nor the talent to do it! In my view so much time and effort spent simply in designing a nice HTML/JSP page or rich client equivalent (with SWT for ex.) is too frustrating for the end result. I sometimes have to do it when delivering an end-to-end product for clients, and typically most of my time will be wasted on these UI stuff! I guess I could outsource all these, actually I even tried it once... but finding a good designer willing to develop inside JSP page is another challenge on its own!!!
When I first did some RCP stuff in Eclipse, I appreciated all the advanced design patterns available in library such as JFace but I soon got bored and tired again in dealing with all these widgets details consideration, I'm hopeless.
I guess I'll stick to creating domain business layer, service business layer, data access layer, and other more non-visual feature!
Martin
Tuesday, September 19, 2006
Parapente
J'aime la planche à voile mais trouve sa plage d'utilisation beaucoup trop restreinte (du moins ou j'habite en ce moment)... c'est pourquoi j'insiste toujours à y avoir accès lors de voyages familiales vers le sud. La dernière fois (à Punta Cana) par contre fut assez décevante: les journées avec un bon vent furent accompagnés d'interdiction, et pour les journées de levée d'interdiction alors les petites voiles suffisaient à peine à sortir des vagues donc impossible à éxécuter, genre de catch22 de la planche!
Le parapente semble mieux adapté au variation des conditions car on peut y exploiter deux types de vent: dynamique et thermique, ce dernier permet une ascension importante en altitude tandis que le premier est exploité le long de falaise ou parois de montagne! Le site ou j'ai pratiqué était assez bien, quoique loin des altitudes alpiennes, mais offre l'avantage de départ des quatres points cardinaux, assurant un départ garanti (sauf journée orageuse ou trop venteuse). Pour les intéressés:
Voici une petite séquence de la fin de ma descente en tandem.... juste après avoir fait quelques bonnes vrilles à 360 degré
Martin
Tuesday, September 12, 2006
Unit Testing but...
However, there are a few stricter recommendations commonly found among the unit tester fanatic or the extreme programmer advocate that I find, to say the least, debatable:
- your unit test should focus on a single method of a single class, otherwise it is not really unit test.
- always build your unit test first and then your application class afterward.
Point 1 actually emphasizes the term unit, and violating this makes your unit test more like integrated test which I agree. But in my view these tend to be more meaningful and practical.
First of all, methods should be small, precise, have a single clear responsibility and have descriptive name that conveys their purposes. As such I tend to agree with recommendation that limit the size of a single method (R. Johnson gave as a ballpark figure between 30-40 lines of code including all comments, while J. Kerievsky goes as far as recommending ten lines of code or fewer, with the majority of them using one to five lines of code). Keeping methods small and giving them intuitive name produce much easier and self-documented code: I like this idea since it helps reduce the need for documenting your code!
This is why I feel that the principle 1. above is opposed to the "writing short method" approach, since small method do not contain enough complex logic that requires a dedicated unit test on its own.
A junit class that test and validate the effect of each and every single method produces on the state of current object or some other dependents (through Mock-up objects) is often straightforward and thus overkilled! Also, a large number of method may not deserve a full dedicated test on them, since not only their logic is simple but also the impact on state is minimal.
That's why I twist my unit test a bit to make them more integrated test, i.e. test only important methods in the class in relation with their impact on itself and on its dependencies (external library, other piece of my code..). Ok, this is not always possible especially when the dependency library is costly and resource intensive component (then I'll use Mock-up for such case), but in very frequent usage, this allows me to validate and better understand the external library during my test as well as testing my code against its dependency. I find myslef even doing such integrated test with code at the service layer level (above the DAO layer) and validating its effect at the database tier. Using a small memory-based database engine such as HSQLDB helps negating the perfomance penalty of doing this.
As for the point 2, I usually adopt more of a concurrent approach, i.e. draft the application class and once it stabilizes create the test class and making it evolve simultaneously. The first few version of my class/interface are a bit too dynamic and sketchy to really have an accompanying test class. So to limit the need to duplicate my changes in both, I'd rather wait till I'm more comfortable with the class/ interface and then proceed with writing test case.
The only advantage I see in creating the test case first, is when I really don't know how my object's going to be used in the client code. However, in that case, I'd rather use a pencil and sketch some use case scenario beforehand...
Martin
Thursday, September 07, 2006
Handling persistence with ORM
- avoiding the necessity to code against lower level JDBC API
- dealing with data persistence concern in a more transparently way more aligned with object oriented code paradigm
- providing isolation to database vendor specifics allowing easy porting to a number of different DB backend
- providing additional services support built-in such as connection pooling, caching, etc .
- reducing the need to be highly skilled in SQL, although ignoring about relational concept and SQL is definitely not realistic.
- writing less code
On the flip side, I've realized that there are drawbacks as well, such as:
- providing least common denominator functionality to achieve DB neutrality
- losing control of the SQL statetement automatically generated for us
- some performance degradation, no matter what the tool vendor will pretend (ORM will always be one layer on top of JDBC...), however smart caching strategy can mitigate this
- requiring additional knowledge of the ORM API (so less code to write but more library code to understand and make use of)
- failing when the application use case is more focused on data reporting and aggregation of large data volume rather than on data entry transaction-based use case.
Typically on the last project I built using Hibernate, I've enjoyed spending more time on the design of a good domain model layer since I've spent less on the persistent logic concern. However, I discovered later through a more realistic usage and data volume test, that it suffered some nasty performance degradation in some specific use case that were not discovered through unit testing (unit testing is only concerned with functionality testing and not performance scaling issues).
Without going into details, the problem had to do with the number of round-trip Hibernate was triggering to fetch object data graph. I had designed some relation (1:1 or N:1) to be eager fetch (always fetch the related object) instead of using Lazy fetching strategy (fetch in database only when necessary). This was good in some scenario, since some data dependencies were always needed and this avoided a second database call to get this dependent object data. However when confronted with getting collection of data, the effect was actually a separate DB call for every single data elements within the collection. So getting a long list of N items resulted in N+1 DB call! Alternative solutions exist for this, but the recommendation is to model most (if not all) object relation using a Lazy strategy and adjust this default by specifying different fetch mode during run-time.
Bottom line, there is no magic bullet especially when it comes with database interaction. We need a good grasp in relation database concept in order to build application interacting with database, no matter what tools or framework you'll be working on.
Martin
Sunday, September 03, 2006
Data mining background
Data mining has its root, among other thing, in statistics and computer learning. To generalize things greatly, it is observed that depending on your background, these two will tend to view data mining performance very much differently..... Statistical background will rate performance on the statistical significance and inference power and score, whereas the computer scientist tend to measure performance on both the algorithm efficiency and scalability. However I realize that the two approaches really are two sides of the same coin, and this is reflected in the most recent scientific literature.
Various definition of data mining can be found in the literature, but I personally appreciate the more academic point of view than the marketing one commonly marketed by vendors. Here's some explanation excerpt taken from the excellent book « Principles of Data Mining » from David Hand, Heikki Mannila and Padhraic Smyth (MIT Press) (seems to be one of the few data mining books respected inside the statistical community).
It (Data mining) is recognized as well defined procedures that take data as input and produces output in the form of models (a summarized descriptive form of the data globally) or patterns (a descriptive form of some local phenomenon happening on a fraction of the data). The well defined procedure contrast with computational method which does not guarantee to terminate after a finite number of steps as opposed to a data mining procedure.
Data mining is concerned with building empirical models that are not based on some underlying theory about the mechanism through which the data arose but rather models consisting of a description of the observed data. In this sense, a good model is qualified as « generative »in the sense that data generated according to the model will share the same characteristics as the real data from which the model was generated.
They also offer an interesting decomposition of data mining algorithms into orthogonal components which contrasts with the magical and reductionist view marketed by tool vendors (always around the idea of simply applying specific algorithms to magically accomplish your task at hand). In essence, Data mining algorithms intends are to perform specific task on a sample or a complete multivariate datasets. But the task (1st component) is one of many other component that a mining algorithm usually address:
Obviously, the Data mining task in question: whether it be visualization, prediction (classification or regression), clustering, rule pattern discovery, summarization through descriptive model, pattern recognition, etc.
The functional form of the model or the structure of the pattern. Example include linear regression forms, non-linear functions such as the one resulting from Neural network, a decision tree form, a hierarchical clustering model, an association rule, etc. These forms delimit the boundary of what we can expect to approximate or learn.
The score function used to judge on the quality of the fitted model used to summarize the observed data or of the pattern used to characterize a local structure of the data. This score function is what we try to maximize (or minimize) when we fit parameters to our model. It can be based on goodness-of-fit (how well the model describes the observed data), or also on the generalization performance, i.e. how well it describes on the data not yet observed (for prediction purposes).
The search or optimisation method used: the computation procedures and algorithm used for maximizing the score function for a particular model. The search can limit itself to select the best parameters value within the k-parameters space (as in the case of k-th order polynomial function form) when the structure is fixed. And we may have to select first from a set or families of different structures.
The data management technique to be used for storing indexing and retrieving the data. This aspect becomes primordial when it is time to process massive data sets excluding the use of the main memory alone.
The ideal tool would allow you to use different components independently from each other during some data mining activity. This level of agility and flexibility is however not possible in tools today... which may be justified and reasonable when it comes to the components optimization and data management but much less for the functional form and the score function components.
In practice, tools usually offer pre-packaged algorithms from which you can easily fall into the algorithm trap where you are only expected to apply some well established mining algorithm to accomplish magically the specific task at hand. This is the typical black-box paradigm that I've learned to despise in data mining (note that black box abstraction is overall beneficial especially in software OO programming model).
My curiosity simply forces me to step back and discover what's all the different mining algorithms applied to my data. After reviewing some of the literature, I've realized that I came across a lot of theory/practice in statistics during my curriculum, however I cannot say the same for machine learning (although I did some cluster analysis during my master thesis). So in the spirit of increasing your theoretical grasp of the underlying principles, let me give you a list of the books I highly recommend (in ascending order of statistical theory prerequisite):
Intelligent Data Analysis. Michael Berthold (Editor), David J. Hand (Editor). Springer (note that this one leans a bit toward AI with the inclusion of material such as fuzzy set logic and genetic algorithms)
Principles of Data Mining. David Hand, Heikki Mannila and Padhraic Smyth. MIT Press
Pattern Recognition and Machine Learning. Christopher Bishop. Springer
Elements of statistical learning Data mining, inference, and prediction. Tibshirani & Friedman. Springer.
Martin
p.s. I will try to summarize some the principles found in these books that I consider more than useful for any data mining practitioners to have. Although, blogs are not the ideal platform for such knowledge sharing, it is the most convenient one I have at hand (at least currently).
Tuesday, August 15, 2006
Préférences
- exécuter versus faire exécuter
- travail sur un plus petit nombre de tâches concurrentes plus exigeantes versus un plus grand nombre de tâches plus routinières et monotones
- travail dans un cadre précis et concret versus cadre plus flou et plus théorique
- travail d'analyse et de raisonnement dépendant de la logique versus travail de recherche dépendant du bagage de connaissance
- comprendre versus apprendre
- apprendre grâce à la compréhension versus apprendre grâce à la mémorisation
- travail de gestion technique des projets/ressources versus travail de gestion administrative
- travail varié et exploratoire versus travail fixe et récurrent donc redondant.
Cette liste est bien évidemment dynamique mais disons que, depuis quelques années déjà, elle se stabilise assez bien. Les fluctuations semblent être en terme d’addition et non de modification!
Martin
Note: From time to time and usually when my post will be more personal, I'll blog in french. As you may have already noticed, english is not my first language, but rather the language I often use in my professional life.
Thursday, August 03, 2006
Java and Oracle
Originally the strategy was to follow a database-centric strategy where merely all software layer would be offered and hosted directly inside the database engine. This controversy strategy (to say the least) has since been reversed from 9i and 10g where some J2EE technologies already integrated inside the database (e.g. EJB container, JSP and servlet) have been desupported.
The focus is now on providing a complete Application Server suite (J2EE compliant) outside the database offering a vast number of services and support, pretty much like IBM WebSphere, BEA WebLogic or JBoss Application Server.
However, this strategy leads to the development (from beginning of Oracle 8i) of a fully functional and compatible Java Virtual Machine inside the database: OracleJVM.
Each of these two components are commented next.
1- OracleJVM
As of 10g release the OracleJVM offers these characteristics:
- support the J2SE 1.4.2 as specified by Sun Microsystems
- supports only the headless mode of the Java AWT (i.e. no GUI will be materializable on the server or remotely)
- java classes (bytecode), resources files and java source code (optional), all reside at the database and stored at the schema level (knows as the Java schema object)
- each session (user connecting to the database and calling Java code) will see its own private JVM (although for performance reason the implementation does share some part of Java library between session)
- core Java class libraries are run natively through the use of Ahead-of-time compilation to platform-specific C code before runtime
- core Java libraries are stored and loaded within the PUBLIC schema and thus available to all other schema
- application specific Java classes are stored and loaded within the user schema (the owner)
- besides writing the Java class, compiling it and running it, OracleJVM requires two extra steps in its development/deployment cycle: 1- class needs to be loaded into the database (done through a utility called loadjava, 2- class needs to be published when callable from SQL or PL/SQL (done by creating and compiling a call specification or a.k.a. PL/SQL wrapper) to map the Java's method parameter and return type to Oracle SQL type.
- granting execution rights is also needed when running a Java classes located in other user's schema
- class loading is done dynamically as in conventional JVM, however it is done into shared memory, so only one-time loading speed hit is encountered among all users code requiring the class
- instead of a global classpath defined at runtime to resolve and load all application classes, OracleJVM uses a resolver per each class during class installation specifying in which schema their depending classes reside
- multi-threading is usually achieved using the embedded scalability of the database server, making Java language-threads needless since they won't help improve the concurrency of the application (this helps avoid complex multi-threading issue inside Java code)
- OracleJVM offers adapted version of JDBC (called server-side internal driver) which is specially tuned to provide fast access to Oracle data from Java stored procedure, as well as a optimized SQLJ server-side translator.
Execution control:
How do we exactly start off a Java application located inside the Oracle database or in other words what is the equivalent entry point of the static main method in a "normal" application launched by a conventional JVM? This process is referred to in Oracle terminology as a Call and can be done by calling any static method within available loaded and published classes. These published classes must then contain a static method entry point, and are qualified as the Java counterpart of a PL/SQL procedure (referred to by the term Java Stored Procedures).
Some possible scenario of a Java called includes:
- a SQL client program running a Java stored procedure
- a trigger (i.e. event fired off by defined SQL DML statement) running a Java stored procedure
- a PL/SQL program calls a Java code
These Java Stored Procedures are callable from PL/SQL code but can also call PL/SQL procedure.
Some thoughts: Even though I've never played with OracleJVM, I'm yet to be convinced about its advantage: stored Java procedures seems a bit like writing Java code with a procedural mindset? It seems that the only advantage is the possibility to write and centralize business rules that are more portable and powerful than PL/SLQ code and that are available to application written to bypass the Application Server tier?
2- Oracle OC4J J2EE Application Server (or a.k.a. OracleAS):
This server referred to as OC4J, now includes an ever growing number of components (Web server, J2EE technology, ORM with TopLink, Portlet, wireless, Business Intelligence, etc). Its J2EE support includes: JSP, servlet, JSF and ADF framework (using event-based model for web http processing), EJB, JNDI, XML support (schemas, namespace, DOM, SAX, XPath...), Web Services (WSDL, UDDI, SOAP).
The type of applications supported by this infrastructure are usually large and complex, i.e.
- involve multiple application tier: the central AS tier where the business logic is maintained, a web tier (maybe part of the AS tier) interacting with Web clients, a backend database tier where persistent data is preciously stored, client tier from fat to thin.
- involve multiple user with different role and rights accessing concurrently common data
- involve different remote user sites (i.e. implies Web access) and heterogeneous environment
- involve sophisticated business rule;
- involve interaction with other EIS enterprise information system through the J2EE connector Architecture (ERP such as SAP, legacy information system)
- involve web services support
Of course not all application will need all these, but to pull its weight and leverage this considerable software infrastructure weight, the application specification should meet a fair level of complexity before committing to this framework. This technology overweight is probably responsible of the creation of lighter and simpler initiative coming from opens source community (lightweight Framework only requiring a web jsp/servlet container, such as the one I described here).
MartinMonday, May 29, 2006
J2EE development
Before starting to do any J2EE Web development I did my own research on tools and libraries that would best meet my web transaction-based application requirement (e.g. things like flexibiltiy, simplicity, availability, cost, adoption..). I finally decided to go with the Spring Framework for all integration code and parameterization settings, Hibernate on the data tier to handle the ORM aspect, and Struts for the Web tier. I discovered since then that these exact set of tools are promoted by Source Labs (http://www.sourcelabs.com/?page=software&sub=sash) as the SASH stack. Although I appreciated developing using these libraries, I enjoyed even more the best practices that these frameworks encourage through the adoption of sound principle: loose coupling between component, seperation of concerns, design pattern uage like MVC or dependency inversion, etc.
You have a feeling when you're building applications along these principles that it is well architect and clean, however you enjoy it even more when 5-6 months later the client calls you to update its requirements!
- A seperate and "dumb" View layer (JSP page);
- A seperate Control layer (action and action setting files in Struts)
- A separate Model/Business layer (business layer is using simple POJOs following java bean rules which allow dependency injection with Spring)
- A seperate Data layer (thourgh DAO and Hibernate ORM)
- An integration and configuration to glue all layers through Spring bean application context file
- And finally a simple servlet/JSP container server ( e.g. Tomcat) to service and host the application deployment.
As a good advocate of open source, I put my principle into practice by putting such an application available to anyone intrested, just contact me by email and I'll send you a copy of the project.
Martin
Saturday, April 15, 2006
JDMAPI: some info
1- Mining Function
First the features that are probably central to the API are called Mining Function. These functions relate to the objective one wish to achieve and operate on the individual record or case (collection of related attributes belonging to one entity or transaction used as input for building the data mining model, e.g. customer, purchase, etc..).
These can also be categorized by whether they are supervised (the model is trained and tested using a known target attribute value) or unsupervised (no target value variable is of used).
Looking at the Enum type called MiningFunction, we find these supervised functions :
- Classification (predict a target attribute of type categorical)
- Regression (predict a target attribute is of type numerical, i.e. continuous)
- *Supervised Multi-Target (for model predicting multiple target attribute at once)
For unsupervised functions, we have
- Clustering (associate record to natural cluster that "closed" )
- Association (discover hidden interrelationship or correlation among the variables)
- *Anomaly Detection (identify rare and unexpected case)
- *Time Series (understand the pattern and forecast the time-ordered series of case)
- *Feature Extraction (project the set of all attributes into a much smaller set of features useful for visualization capturing important characteristics of the data)
There is also a function applicable to both supervised and unsupervised called Attribute Importance which is helpful to reduce the number of attributes and complexity of the model to build. This function helps identify the most relevant attributes and reduce nose when building mining model.
2- Mining Algorithm
To build mining model for each of these functions, we need to apply specific algorithms. By checking the Enum class called MiningAlgorithm, we find:
- Decision Tree
- Feed Forward Neural Net
- kMeans (a k-means clustering algo)
- Naive Bayes
- SVM Classification (for classification)
- SVM Regression (for regression)
- *Arima (for time series)
- *Arma (for time series)
- *Auto Regression (for time series)
- *NMF (Non-negative Matrix Factorization algorithm for feature extraction)
3- Mining Task
The API includes a definite set of task used to construct the various mining object (see Mining Named object) : These task defines by the Enum class 'MiningTask' are given next:
- buildTask (to construct a mining model, see mining function)
- testTask (for validating the mining model on an independent test dataset, only for supervised)
- applyTask (for applying the mining model to a new data source)
- computeStatisticsTask (to get basic statistics of attributes from source physical data)
- importTask/exportTask (for interacting with external application/framework)
4- Mining Objects
Typically when one submit mining task to the DME (data mining engine), this will generate some persistent objects called NamedObjects. These objects are normally stored in the mining repository and can be saved and restored with the API:
- buildSetting (used to specify the model to be built, i.e. the mining function, the source data/attribute, the algorithm to be used with its settings, etc.)
- testMetrics (used to produce a test result)
- applySetting (used to define the data on which a model needs to be applied)
- physicalDataSet (a pointer to the original data used for build, test and apply e.g. a database table a file )
- logicalData (optionally describe the physical data set to change attribute name and their types)
- model (used to store the resulting mining model)
- task (used to refer to existing task and their status )
- taxonomy (used to define hierarchical grouping definition of categorical attribute value)
- costMatrix (matrix used with classification to associate cost with the actual versus prediction value )
Note: * refers to release 2.0 functions.
Martin