Monday, April 09, 2007

Consultation informatique

Dans ma vie professionnelle j'ai la chance de combiner deux types de travail assez distinct. Le premier est relativement commun car il implique un emploi comme "Lead technique" d'une équipe de développeur d'application de type BI.

Par contre, ma deuxième vie professionnelle est plus stimulante et comporte son lot de défis. Dans cette seconde vie, je fais de la consultation informatique (freelance) dans mon domaine d'expertise: i.e. tout ce qui gravite autour des tâches d'analyse, de design et développement d'application développée avec Java et exigeant des interactions avec des systèmes de base de données.

Mon cheminement professionnel m'a permis de maîtriser ces deux spécialités comportants certaines incompatibilités, soit l'implantation de systèmes de base de donnée basés sur un design de type relationnelles des données, et l'implantation d'applications basées sur un design de type Orienté Objet (dans mon cas c'est Java). Il faut noter que certaines incompatibilités sont quelques fois une conséquence de la mauvaise fois des gens de chacun des deux clans, mais ceci est une autre histoire....

La connaissance de ces deux spécialités s'avère d'autant plus utile avec l'explosion d'application sur l'internet dont les caractéristiques justifient l'utilisation de ces deux spécialités.

Tout cela pour dire, que je suis bientôt mûr pour un changement de carrière ... mon emploi chez IMS étant moins stimulant car nous servons une industrie beaucoup plus conservatrice avec de long cycle de vie des applications logiciels (je travail en ce momnent sur une application vieille de plus de 30a ns fonctionnant sur le Mainframe). Il faut dire que j'arrivais du monde un peu fou de la télécommunication qui est assez intense et garnie en terme d'innovation.


Martin

Tuesday, February 06, 2007

JDBC convenient programming with Spring

There is no need to re-iterate what has already been said and spread about the usefulness of the Spring framework... but it seems that every time I make use of a feature I come to the same conclusion: it simply makes you more satisfactory of your code, that is how I see it anyway !

Most of my database-related code usually made use of Hibernate because the sizing and the extend of the domain model simply justified leveraging the complexity of this framework.

However, in my current project I decided to give a shot at Spring jdbc support packages mostly because my domain model is simple enough (less than 10 entities with basic relationship) and because... well exploring more of the Spring library!


Although the library only really offers some thin wrapper around jdbc api (as opposed to a complete ORM solution), it does it in a way that you can create your data access code in a more object oriented fashion while keeping a close access to the underlying jdbc low-level api.

To illustrate this, I've created some helper classes (one per each business object entity) wrapping Spring jdbc data access object (e.g. SQLUpdate, JdbcTemplate, MappingSqlQuery) and offering a convenient way to centralize all sql-related string (sql command, table name and field, etc..).

On top of these classes, you can actually implement all generic sql access code (e.g. delete by id) and generic sql commands. Here's how look like the superclass, refer here as BaseSqlHelper:


And Here's one example of how one could implement a particular subclass (in this specific example the subclass handles sql for the User business entity object):



What does this offer you:
1- get all your sql strings contralized in one convenient place
2- benefit from code re-use by moving-up all your common fields (e.g. id, createDate,..) and common sql operations (e.g. delete from ... where id=, select ... from ... where id=) for all business entities into the superclass
3- exploit the MapRow capability to treat query response as real business entities and not merely as data field.
4- Ease for accomodating new fields and remove existing ones


Martin

Thursday, January 04, 2007

Utilisation d'Hibernate

Pour faire suite au post précédent, j'ai observé que certaines applications sont construites sans tenir compte des limites et contraintes d'Hibernate... ils utilisent cette librairie comme le "silver bullet" qui va se charger de tout ce qui touche la question de la persistance et ce sans effort ou configuration spéciale!


Par exemple concernant l'utilisation du Paramètre Lazy dans le mapping, plusieurs applications sont configurées avec lazy=false. Ceci facilite évidemment la question du fetching des graphes d'objet, mais ceci à un prix et ce prix s'appelle performance!


Recommandation:

Pratiquement toutes les entités et les associations de compositions devraient être configurées de façon Lazy (i.e. lazy=”true”). Il ne faut pas confondre le paramètre lazy et le paramètre fetch, le premier indique seulement à Hibernate s’il pourra créer un proxy (qui exigera une initialisation en session ouverte) pour la classe en question ou la collection, tandis que dernier indique comment les relations doivent être fetchées (eager ou non).

En utilisant lazy=”false” ceci a comme « side-effect » de pre-fetché toutes les dépendances de l’entité et ultimement la base de donnée complète sera chargée en objet ! Cela peut être pratique puisqu’on n'a pas à se poser la question de devoir initialiser ou pas les dépendances, mais ça devient vite catastrophique lorsque le volume de donnée est important.

La décision de fetcher ou ne pas fetcher (eager fetch) est une stratégie qui doit être défini en runtime puisque différente utilisation exige différent accès aux données. La couche DAO peut être utilisée en ce sens en fournissant des options de loader ou ne pas loader les dépendances (avec Hibernate.initialize()) avant de fermer la session Hibernate. Configurer avec lazy=”false” élimine tout simplement cette flexibilité.

Si pour une raison quelconque, une dépendance d’une entité doit toujours être pre-fetché, alors utiliser l’option fetch=join dans le mapping mais en gardant le lazy à false pour cette entité dépendante.

L’auteur d’Hibernate a lui-même reconnu ce fait et recommande depuis Hibernate 3.0 que toutes les mapping des class et des collections soient lazy= ‘true’ (ceci est maintenant le défaut utilisé).

Martin

Friday, November 10, 2006

Performance des applications aves des outils ORM

Evidemment certains aspects sont plus importants que d'autres lorsqu'il est temps de considérer les performances des application se connectant avec les bases de données de type relationnel. Si on utilise de surcroît des outils de mapping ORM, alors la mauvaise utilisation de l'outil peut venir brouiller encore plus les cartes.

De façon général, voici par ordre d’importance les éléments qui doivent être considérés lors d’optimisation des applications interagissant avec une base de donné avec un outil ORM:

  1. Le model ER (Entité Relationel) da la base de données Peu importe comment l’application et les modules ORM sont optimisés si le modèle ER de la base de donnée est pourri il sera difficile de produire des applications performantes. Normalement en utilisant des outils ORM, ce problème ne se pose pas puisque la création du modèle suit plus ou moins le modèle UML. Ce problème se pose normalement lorsqu’on dépend du base de données « legacy » qui ne peut pas être modifié.
  2. La configuration du ORM (« fetching strategies », « caching », « report queries »). Cet aspect est critique car une mauvaise configuration résultera forcément en de très mauvaises performances peu importe les autres aspects et avec une dégradation linéaire en fonction du volume de données.
  3. Le model physique de la base de donnée (tables, contraintes, indexes) Présumant que l’aspect précédent est respecté, ceci va garantir de bonnes performances avec l’accroissement du volume de données.
  4. Le tuning général de la base de données (vendor-specific) Cet aspect est intimement lié au type de base de donnée, chaque vendeur fournit différent mécanisme afin d’améliorer la performance des applications de type OLTP ou bien OLAP. Cet aspect est difficilement généralisable et est la responsabilité du DBA.

Il reste toujours des cas d'exceptions mais en général ces principes s'appliquent assez bien pour les application de type OLTP avec base de donnée.

Martin

Sunday, October 22, 2006

Muret de pierre sèche

Ouf!

Cet été j'ai finalement complété mon petit aménagement "style naturel" autour de notre piscine. Le plus dur fût sans contredit le mur de pierre sèche qui a nécessité plusieurs centaines de grosses pierres traînant ici et là. Après la construction de notre maison il restait à faire tout l'aménagement paysager. Nous avons donné quelques contrats mais les coûts exhaustifs de ce type de main d’œuvre saisonnier m'a convaincu d'en faire une partie moi-même.

Le sol sur lequel notre maison est construite est littéralement envahi par les pierres, donc j'ai décidé de faire 1 pierre 2 coups (mauvais jeux de mots): ne pas seulement les transporter pour m'en débarasser (à grosses sueurs), mais de les réutiliser du même coup (à plus grosses sueurs) pour en faire un muret, une terrase et des marches pédestres. Le seul principe important pour le muret est qu'étant donné qu'il n'y a aucun ciment pour assurer la cohésion (il tient que par gravité), il faut s'assurer qu'une pierre repose sur au moins trois autres pierres.

Et après un grand merci à ma copine et à ma mère pour l'esthétique jardin de fleurs (ok, ok aussi à mes enfants qui se sont enfargés plus d'une fois), voilà le résulat de cet aménagement...!

Martin



Monday, October 16, 2006

Statistical General Concept

Update: Most of these notes are work in progress and likely to remain as such for ... well no end date planned!  These notes, originally included as google-doc iFrame,  are thus removed from this blog... may decide to make them public directly as google-doc later on. 



This post is part of the notes I'm gathering from various references providing theoretical background and explanation related to data mining, analytics and computer learning (see this post for the book reference). I'm gathering these notes in hope to be a little smarter in applying and interpreting data mining algorithms taken out-of-the-box from mining tools (I must admit I'm also doing this to serve my endless quest for understanding).

This is actually linked (through an "iframe" tag) to a Google Doc that I keep updating as I face projects making use of new mining algorithms... so this is work-in-progress. I realize that blog is probably not the best way to publish live text, but is the easiest one for me.

This first part gathers basic topics from statistics difficult to classify into a very precise subject... should pretty much serve as a refresher for most people in this domain.


Definition:
  • instances = data objects observed and analysed (sometimes referred to as objects, data points...)
  • variables= characteristics measured (for continuous) or observed (for categorical) for each instance

Notation:
  • n data objects (sample size)
  • X generic input variables. When it is a vector, its component variable j is expressed with subscript: Xj
  • x denotes some observed instance, and when we have p-variables, we denote x1 .. xp as the the real-valued for the 1.. p variables measured on the particular object or instance.
  • xk(i) correspond to the measure for variable Xk of the i-th data objects, where i has 1 .. N.
  • x (in bold) correspond to the vector of n observation of a single variable x.
  • X (capital in bold) correspond to the matrix N x p, containing N input p-vector xp(1..N).

iFrame notes now removed.


Friday, September 29, 2006

I hate UI-type development

Why do I hate having to deal with UI-type development... I don't know but I have some hints:

I simply suck at it!

Although I enjoy using intuitive UI and appreciate the design value of it, I neither have the patience nor the talent to do it! In my view so much time and effort spent simply in designing a nice HTML/JSP page or rich client equivalent (with SWT for ex.) is too frustrating for the end result. I sometimes have to do it when delivering an end-to-end product for clients, and typically most of my time will be wasted on these UI stuff! I guess I could outsource all these, actually I even tried it once... but finding a good designer willing to develop inside JSP page is another challenge on its own!!!

When I first did some RCP stuff in Eclipse, I appreciated all the advanced design patterns available in library such as JFace but I soon got bored and tired again in dealing with all these widgets details consideration, I'm hopeless.

I guess I'll stick to creating domain business layer, service business layer, data access layer, and other more non-visual feature!

Martin

Tuesday, September 19, 2006

Parapente

J'ai eu la chance d'essayer le parapente pendant mes dernières vacances et j'ai adoré les sensations !!! Ca confirme mon attirance pour les sports exploitants la dynamique du vent: j'adore la magie qui opère lorsqu'on est porté par la force du vent...et la prise de vitesse en dépit de l'absence de bruit de moteur.

J'aime la planche à voile mais trouve sa plage d'utilisation beaucoup trop restreinte (du moins ou j'habite en ce moment)... c'est pourquoi j'insiste toujours à y avoir accès lors de voyages familiales vers le sud. La dernière fois (à Punta Cana) par contre fut assez décevante: les journées avec un bon vent furent accompagnés d'interdiction, et pour les journées de levée d'interdiction alors les petites voiles suffisaient à peine à sortir des vagues donc impossible à éxécuter, genre de catch22 de la planche!

Le parapente semble mieux adapté au variation des conditions car on peut y exploiter deux types de vent: dynamique et thermique, ce dernier permet une ascension importante en altitude tandis que le premier est exploité le long de falaise ou parois de montagne! Le site ou j'ai pratiqué était assez bien, quoique loin des altitudes alpiennes, mais offre l'avantage de départ des quatres points cardinaux, assurant un départ garanti (sauf journée orageuse ou trop venteuse). Pour les intéressés:


Voici une petite séquence de la fin de ma descente en tandem.... juste après avoir fait quelques bonnes vrilles à 360 degré, accompagné de je ne sais combien de G, assez pour sentir mon cerveau descendre tout juste au niveau des pieds! Et oui, la fameuse force centrifuge (ou serait-ce centripète... mes cours de physique du secondaire sont quelque peu enfouis sous la masse d'expérience qui ne cesse d'expansionner avec le temps!) agit beaucoup dans ce genre de manoeuvre...




Martin

Tuesday, September 12, 2006

Unit Testing but...

I really appreciate developing using the unit testing approach and as such I always have a JUnit library somewhere in my classpath while building an application. This really brings up my confidence into my code and allow me to refactor at ease without worry about breaking all existing functionality.
However, there are a few stricter recommendations commonly found among the unit tester fanatic or the extreme programmer advocate that I find, to say the least, debatable:


  1. your unit test should focus on a single method of a single class, otherwise it is not really unit test.
  2. always build your unit test first and then your application class afterward.

Point 1 actually emphasizes the term unit, and violating this makes your unit test more like integrated test which I agree. But in my view these tend to be more meaningful and practical.

First of all, methods should be small, precise, have a single clear responsibility and have descriptive name that conveys their purposes. As such I tend to agree with recommendation that limit the size of a single method (R. Johnson gave as a ballpark figure between 30-40 lines of code including all comments, while J. Kerievsky goes as far as recommending ten lines of code or fewer, with the majority of them using one to five lines of code). Keeping methods small and giving them intuitive name produce much easier and self-documented code: I like this idea since it helps reduce the need for documenting your code!

This is why I feel that the principle 1. above is opposed to the "writing short method" approach, since small method do not contain enough complex logic that requires a dedicated unit test on its own.


A junit class that test and validate the effect of each and every single method produces on the state of current object or some other dependents (through Mock-up objects) is often straightforward and thus overkilled! Also, a large number of method may not deserve a full dedicated test on them, since not only their logic is simple but also the impact on state is minimal.

That's why I twist my unit test a bit to make them more integrated test, i.e. test only important methods in the class in relation with their impact on itself and on its dependencies (external library, other piece of my code..). Ok, this is not always possible especially when the dependency library is costly and resource intensive component (then I'll use Mock-up for such case), but in very frequent usage, this allows me to validate and better understand the external library during my test as well as testing my code against its dependency. I find myslef even doing such integrated test with code at the service layer level (above the DAO layer) and validating its effect at the database tier. Using a small memory-based database engine such as HSQLDB helps negating the perfomance penalty of doing this.

As for the point 2, I usually adopt more of a concurrent approach, i.e. draft the application class and once it stabilizes create the test class and making it evolve simultaneously. The first few version of my class/interface are a bit too dynamic and sketchy to really have an accompanying test class. So to limit the need to duplicate my changes in both, I'd rather wait till I'm more comfortable with the class/ interface and then proceed with writing test case.
The only advantage I see in creating the test case first, is when I really don't know how my object's going to be used in the client code. However, in that case, I'd rather use a pencil and sketch some use case scenario beforehand...

Martin

Thursday, September 07, 2006

Handling persistence with ORM

In Java application development, using an Object Relational Mapper to connect to database typically will offer many advantages :

  • avoiding the necessity to code against lower level JDBC API
  • dealing with data persistence concern in a more transparently way more aligned with object oriented code paradigm
  • providing isolation to database vendor specifics allowing easy porting to a number of different DB backend
  • providing additional services support built-in such as connection pooling, caching, etc .
  • reducing the need to be highly skilled in SQL, although ignoring about relational concept and SQL is definitely not realistic.
  • writing less code

On the flip side, I've realized that there are drawbacks as well, such as:

  • providing least common denominator functionality to achieve DB neutrality
  • losing control of the SQL statetement automatically generated for us
  • some performance degradation, no matter what the tool vendor will pretend (ORM will always be one layer on top of JDBC...), however smart caching strategy can mitigate this
  • requiring additional knowledge of the ORM API (so less code to write but more library code to understand and make use of)
  • failing when the application use case is more focused on data reporting and aggregation of large data volume rather than on data entry transaction-based use case.

Typically on the last project I built using Hibernate, I've enjoyed spending more time on the design of a good domain model layer since I've spent less on the persistent logic concern. However, I discovered later through a more realistic usage and data volume test, that it suffered some nasty performance degradation in some specific use case that were not discovered through unit testing (unit testing is only concerned with functionality testing and not performance scaling issues).

Without going into details, the problem had to do with the number of round-trip Hibernate was triggering to fetch object data graph. I had designed some relation (1:1 or N:1) to be eager fetch (always fetch the related object) instead of using Lazy fetching strategy (fetch in database only when necessary). This was good in some scenario, since some data dependencies were always needed and this avoided a second database call to get this dependent object data. However when confronted with getting collection of data, the effect was actually a separate DB call for every single data elements within the collection. So getting a long list of N items resulted in N+1 DB call! Alternative solutions exist for this, but the recommendation is to model most (if not all) object relation using a Lazy strategy and adjust this default by specifying different fetch mode during run-time.

Bottom line, there is no magic bullet especially when it comes with database interaction. We need a good grasp in relation database concept in order to build application interacting with database, no matter what tools or framework you'll be working on.

Martin

Sunday, September 03, 2006

Data mining background

Data mining has its root, among other thing, in statistics and computer learning. To generalize things greatly, it is observed that depending on your background, these two will tend to view data mining performance very much differently..... Statistical background will rate performance on the statistical significance and inference power and score, whereas the computer scientist tend to measure performance on both the algorithm efficiency and scalability. However I realize that the two approaches really are two sides of the same coin, and this is reflected in the most recent scientific literature.

Various definition of data mining can be found in the literature, but I personally appreciate the more academic point of view than the marketing one commonly marketed by vendors. Here's some explanation excerpt taken from the excellent book « Principles of Data Mining » from David Hand, Heikki Mannila and Padhraic Smyth (MIT Press) (seems to be one of the few data mining books respected inside the statistical community).


It (Data mining) is recognized as well defined procedures that take data as input and produces output in the form of models (a summarized descriptive form of the data globally) or patterns (a descriptive form of some local phenomenon happening on a fraction of the data). The well defined procedure contrast with computational method which does not guarantee to terminate after a finite number of steps as opposed to a data mining procedure.

Data mining is concerned with building empirical models that are not based on some underlying theory about the mechanism through which the data arose but rather models consisting of a description of the observed data. In this sense, a good model is qualified as « generative »in the sense that data generated according to the model will share the same characteristics as the real data from which the model was generated.


They also offer an interesting decomposition of data mining algorithms into orthogonal components which contrasts with the magical and reductionist view marketed by tool vendors (always around the idea of simply applying specific algorithms to magically accomplish your task at hand). In essence, Data mining algorithms intends are to perform specific task on a sample or a complete multivariate datasets. But the task (1st component) is one of many other component that a mining algorithm usually address:

  1. Obviously, the Data mining task in question: whether it be visualization, prediction (classification or regression), clustering, rule pattern discovery, summarization through descriptive model, pattern recognition, etc.

  2. The functional form of the model or the structure of the pattern. Example include linear regression forms, non-linear functions such as the one resulting from Neural network, a decision tree form, a hierarchical clustering model, an association rule, etc. These forms delimit the boundary of what we can expect to approximate or learn.

  3. The score function used to judge on the quality of the fitted model used to summarize the observed data or of the pattern used to characterize a local structure of the data. This score function is what we try to maximize (or minimize) when we fit parameters to our model. It can be based on goodness-of-fit (how well the model describes the observed data), or also on the generalization performance, i.e. how well it describes on the data not yet observed (for prediction purposes).

  4. The search or optimisation method used: the computation procedures and algorithm used for maximizing the score function for a particular model. The search can limit itself to select the best parameters value within the k-parameters space (as in the case of k-th order polynomial function form) when the structure is fixed. And we may have to select first from a set or families of different structures.

  5. The data management technique to be used for storing indexing and retrieving the data. This aspect becomes primordial when it is time to process massive data sets excluding the use of the main memory alone.

The ideal tool would allow you to use different components independently from each other during some data mining activity. This level of agility and flexibility is however not possible in tools today... which may be justified and reasonable when it comes to the components optimization and data management but much less for the functional form and the score function components.

In practice, tools usually offer pre-packaged algorithms from which you can easily fall into the algorithm trap where you are only expected to apply some well established mining algorithm to accomplish magically the specific task at hand. This is the typical black-box paradigm that I've learned to despise in data mining (note that black box abstraction is overall beneficial especially in software OO programming model).

My curiosity simply forces me to step back and discover what's all the different mining algorithms applied to my data. After reviewing some of the literature, I've realized that I came across a lot of theory/practice in statistics during my curriculum, however I cannot say the same for machine learning (although I did some cluster analysis during my master thesis). So in the spirit of increasing your theoretical grasp of the underlying principles, let me give you a list of the books I highly recommend (in ascending order of statistical theory prerequisite):

Martin

p.s. I will try to summarize some the principles found in these books that I consider more than useful for any data mining practitioners to have. Although, blogs are not the ideal platform for such knowledge sharing, it is the most convenient one I have at hand (at least currently).

Tuesday, August 15, 2006

Préférences

Avec le temps et surtout l'expérience, j'arrive à mieux savoir mes préférences en terme de travail... voici une petite liste de truc dans le format "me convient versus me convient moins":
  • exécuter versus faire exécuter
  • travail sur un plus petit nombre de tâches concurrentes plus exigeantes versus un plus grand nombre de tâches plus routinières et monotones
  • travail dans un cadre précis et concret versus cadre plus flou et plus théorique
  • travail d'analyse et de raisonnement dépendant de la logique versus travail de recherche dépendant du bagage de connaissance
  • comprendre versus apprendre
  • apprendre grâce à la compréhension versus apprendre grâce à la mémorisation
  • travail de gestion technique des projets/ressources versus travail de gestion administrative
  • travail varié et exploratoire versus travail fixe et récurrent donc redondant.

Cette liste est bien évidemment dynamique mais disons que, depuis quelques années déjà, elle se stabilise assez bien. Les fluctuations semblent être en terme d’addition et non de modification!

Martin


Note: From time to time and usually when my post will be more personal, I'll blog in french. As you may have already noticed, english is not my first language, but rather the language I often use in my professional life.

Thursday, August 03, 2006

Java and Oracle

Oracle has been committed since Oracle8i in integrating Java within its database/application architecture. Being confronted to the development of a particular application highly tied to Oracle, I'm taking this opportunity to review the current state of affair as of Oracle 10g. Here's what I found:

Originally the strategy was to follow a database-centric strategy where merely all software layer would be offered and hosted directly inside the database engine. This controversy strategy (to say the least) has since been reversed from 9i and 10g where some J2EE technologies already integrated inside the database (e.g. EJB container, JSP and servlet) have been desupported.

The focus is now on providing a complete Application Server suite (J2EE compliant) outside the database offering a vast number of services and support, pretty much like IBM WebSphere, BEA WebLogic or JBoss Application Server.

However, this strategy leads to the development (from beginning of Oracle 8i) of a fully functional and compatible Java Virtual Machine inside the database: OracleJVM.

Each of these two components are commented next.


1- OracleJVM

As of 10g release the OracleJVM offers these characteristics:

  • support the J2SE 1.4.2 as specified by Sun Microsystems
  • supports only the headless mode of the Java AWT (i.e. no GUI will be materializable on the server or remotely)
  • java classes (bytecode), resources files and java source code (optional), all reside at the database and stored at the schema level (knows as the Java schema object)
  • each session (user connecting to the database and calling Java code) will see its own private JVM (although for performance reason the implementation does share some part of Java library between session)
  • core Java class libraries are run natively through the use of Ahead-of-time compilation to platform-specific C code before runtime
  • core Java libraries are stored and loaded within the PUBLIC schema and thus available to all other schema
  • application specific Java classes are stored and loaded within the user schema (the owner)
  • besides writing the Java class, compiling it and running it, OracleJVM requires two extra steps in its development/deployment cycle: 1- class needs to be loaded into the database (done through a utility called loadjava, 2- class needs to be published when callable from SQL or PL/SQL (done by creating and compiling a call specification or a.k.a. PL/SQL wrapper) to map the Java's method parameter and return type to Oracle SQL type.
  • granting execution rights is also needed when running a Java classes located in other user's schema
  • class loading is done dynamically as in conventional JVM, however it is done into shared memory, so only one-time loading speed hit is encountered among all users code requiring the class
  • instead of a global classpath defined at runtime to resolve and load all application classes, OracleJVM uses a resolver per each class during class installation specifying in which schema their depending classes reside
  • multi-threading is usually achieved using the embedded scalability of the database server, making Java language-threads needless since they won't help improve the concurrency of the application (this helps avoid complex multi-threading issue inside Java code)
  • OracleJVM offers adapted version of JDBC (called server-side internal driver) which is specially tuned to provide fast access to Oracle data from Java stored procedure, as well as a optimized SQLJ server-side translator.

Execution control:

How do we exactly start off a Java application located inside the Oracle database or in other words what is the equivalent entry point of the static main method in a "normal" application launched by a conventional JVM? This process is referred to in Oracle terminology as a Call and can be done by calling any static method within available loaded and published classes. These published classes must then contain a static method entry point, and are qualified as the Java counterpart of a PL/SQL procedure (referred to by the term Java Stored Procedures).

Some possible scenario of a Java called includes:

  1. a SQL client program running a Java stored procedure
  2. a trigger (i.e. event fired off by defined SQL DML statement) running a Java stored procedure
  3. a PL/SQL program calls a Java code

These Java Stored Procedures are callable from PL/SQL code but can also call PL/SQL procedure.

Some thoughts: Even though I've never played with OracleJVM, I'm yet to be convinced about its advantage: stored Java procedures seems a bit like writing Java code with a procedural mindset? It seems that the only advantage is the possibility to write and centralize business rules that are more portable and powerful than PL/SLQ code and that are available to application written to bypass the Application Server tier?

2- Oracle OC4J J2EE Application Server (or a.k.a. OracleAS):

This server referred to as OC4J, now includes an ever growing number of components (Web server, J2EE technology, ORM with TopLink, Portlet, wireless, Business Intelligence, etc). Its J2EE support includes: JSP, servlet, JSF and ADF framework (using event-based model for web http processing), EJB, JNDI, XML support (schemas, namespace, DOM, SAX, XPath...), Web Services (WSDL, UDDI, SOAP).

The type of applications supported by this infrastructure are usually large and complex, i.e.

  • involve multiple application tier: the central AS tier where the business logic is maintained, a web tier (maybe part of the AS tier) interacting with Web clients, a backend database tier where persistent data is preciously stored, client tier from fat to thin.
  • involve multiple user with different role and rights accessing concurrently common data
  • involve different remote user sites (i.e. implies Web access) and heterogeneous environment
  • involve sophisticated business rule;
  • involve interaction with other EIS enterprise information system through the J2EE connector Architecture (ERP such as SAP, legacy information system)
  • involve web services support

Of course not all application will need all these, but to pull its weight and leverage this considerable software infrastructure weight, the application specification should meet a fair level of complexity before committing to this framework. This technology overweight is probably responsible of the creation of lighter and simpler initiative coming from opens source community (lightweight Framework only requiring a web jsp/servlet container, such as the one I described here).

Martin

Monday, May 29, 2006

J2EE development

Before starting to do any J2EE Web development I did my own research on tools and libraries that would best meet my web transaction-based application requirement (e.g. things like flexibiltiy, simplicity, availability, cost, adoption..). I finally decided to go with the Spring Framework for all integration code and parameterization settings, Hibernate on the data tier to handle the ORM aspect, and Struts for the Web tier. I discovered since then that these exact set of tools are promoted by Source Labs (http://www.sourcelabs.com/?page=software&sub=sash) as the SASH stack. Although I appreciated developing using these libraries, I enjoyed even more the best practices that these frameworks encourage through the adoption of sound principle: loose coupling between component, seperation of concerns, design pattern uage like MVC or dependency inversion, etc.


You have a feeling when you're building applications along these principles that it is well architect and clean, however you enjoy it even more when 5-6 months later the client calls you to update its requirements!

Without going into details, the web application developped with these frameworks usually follows architecture along these lines:
  1. A seperate and "dumb" View layer (JSP page);
  2. A seperate Control layer (action and action setting files in Struts)
  3. A separate Model/Business layer (business layer is using simple POJOs following java bean rules which allow dependency injection with Spring)
  4. A seperate Data layer (thourgh DAO and Hibernate ORM)
  5. An integration and configuration to glue all layers through Spring bean application context file
  6. And finally a simple servlet/JSP container server ( e.g. Tomcat) to service and host the application deployment.

As a good advocate of open source, I put my principle into practice by putting such an application available to anyone intrested, just contact me by email and I'll send you a copy of the project.


Martin

Saturday, April 15, 2006

JDMAPI: some info

Here's some information I've gathered by looking at the API (JDMAPI spec version 1.0) and at the material found here and there:

1- Mining Function

First the features that are probably central to the API are called Mining Function. These functions relate to the objective one wish to achieve and operate on the individual record or case (collection of related attributes belonging to one entity or transaction used as input for building the data mining model, e.g. customer, purchase, etc..).

These can also be categorized by whether they are supervised (the model is trained and tested using a known target attribute value) or unsupervised (no target value variable is of used).

Looking at the Enum type called MiningFunction, we find these supervised functions :

  • Classification (predict a target attribute of type categorical)
  • Regression (predict a target attribute is of type numerical, i.e. continuous)
  • *Supervised Multi-Target (for model predicting multiple target attribute at once)


For unsupervised functions, we have
  • Clustering (associate record to natural cluster that "closed" )
  • Association (discover hidden interrelationship or correlation among the variables)
  • *Anomaly Detection (identify rare and unexpected case)
  • *Time Series (understand the pattern and forecast the time-ordered series of case)
  • *Feature Extraction (project the set of all attributes into a much smaller set of features useful for visualization capturing important characteristics of the data)



There is also a function applicable to both supervised and unsupervised called Attribute Importance which is helpful to reduce the number of attributes and complexity of the model to build. This function helps identify the most relevant attributes and reduce nose when building mining model.


2- Mining Algorithm

To build mining model for each of these functions, we need to apply specific algorithms. By checking the Enum class called MiningAlgorithm, we find:

  • Decision Tree
  • Feed Forward Neural Net
  • kMeans (a k-means clustering algo)
  • Naive Bayes
  • SVM Classification (for classification)
  • SVM Regression (for regression)
  • *Arima (for time series)
  • *Arma (for time series)
  • *Auto Regression (for time series)
  • *NMF (Non-negative Matrix Factorization algorithm for feature extraction)



3- Mining Task

The API includes a definite set of task used to construct the various mining object (see Mining Named object) : These task defines by the Enum class 'MiningTask' are given next:


  • buildTask (to construct a mining model, see mining function)
  • testTask (for validating the mining model on an independent test dataset, only for supervised)
  • applyTask (for applying the mining model to a new data source)
  • computeStatisticsTask (to get basic statistics of attributes from source physical data)
  • importTask/exportTask (for interacting with external application/framework)



4- Mining Objects

Typically when one submit mining task to the DME (data mining engine), this will generate some persistent objects called NamedObjects. These objects are normally stored in the mining repository and can be saved and restored with the API:

  • buildSetting (used to specify the model to be built, i.e. the mining function, the source data/attribute, the algorithm to be used with its settings, etc.)
  • testMetrics (used to produce a test result)
  • applySetting (used to define the data on which a model needs to be applied)
  • physicalDataSet (a pointer to the original data used for build, test and apply e.g. a database table a file )
  • logicalData (optionally describe the physical data set to change attribute name and their types)
  • model (used to store the resulting mining model)
  • task (used to refer to existing task and their status )
  • taxonomy (used to define hierarchical grouping definition of categorical attribute value)
  • costMatrix (matrix used with classification to associate cost with the actual versus prediction value )


Note: * refers to release 2.0 functions.

Martin