Friday, September 29, 2006

I hate UI-type development

Why do I hate having to deal with UI-type development... I don't know but I have some hints:

I simply suck at it!

Although I enjoy using intuitive UI and appreciate the design value of it, I neither have the patience nor the talent to do it! In my view so much time and effort spent simply in designing a nice HTML/JSP page or rich client equivalent (with SWT for ex.) is too frustrating for the end result. I sometimes have to do it when delivering an end-to-end product for clients, and typically most of my time will be wasted on these UI stuff! I guess I could outsource all these, actually I even tried it once... but finding a good designer willing to develop inside JSP page is another challenge on its own!!!

When I first did some RCP stuff in Eclipse, I appreciated all the advanced design patterns available in library such as JFace but I soon got bored and tired again in dealing with all these widgets details consideration, I'm hopeless.

I guess I'll stick to creating domain business layer, service business layer, data access layer, and other more non-visual feature!

Martin

Tuesday, September 19, 2006

Parapente

J'ai eu la chance d'essayer le parapente pendant mes dernières vacances et j'ai adoré les sensations !!! Ca confirme mon attirance pour les sports exploitants la dynamique du vent: j'adore la magie qui opère lorsqu'on est porté par la force du vent...et la prise de vitesse en dépit de l'absence de bruit de moteur.

J'aime la planche à voile mais trouve sa plage d'utilisation beaucoup trop restreinte (du moins ou j'habite en ce moment)... c'est pourquoi j'insiste toujours à y avoir accès lors de voyages familiales vers le sud. La dernière fois (à Punta Cana) par contre fut assez décevante: les journées avec un bon vent furent accompagnés d'interdiction, et pour les journées de levée d'interdiction alors les petites voiles suffisaient à peine à sortir des vagues donc impossible à éxécuter, genre de catch22 de la planche!

Le parapente semble mieux adapté au variation des conditions car on peut y exploiter deux types de vent: dynamique et thermique, ce dernier permet une ascension importante en altitude tandis que le premier est exploité le long de falaise ou parois de montagne! Le site ou j'ai pratiqué était assez bien, quoique loin des altitudes alpiennes, mais offre l'avantage de départ des quatres points cardinaux, assurant un départ garanti (sauf journée orageuse ou trop venteuse). Pour les intéressés:


Voici une petite séquence de la fin de ma descente en tandem.... juste après avoir fait quelques bonnes vrilles à 360 degré, accompagné de je ne sais combien de G, assez pour sentir mon cerveau descendre tout juste au niveau des pieds! Et oui, la fameuse force centrifuge (ou serait-ce centripète... mes cours de physique du secondaire sont quelque peu enfouis sous la masse d'expérience qui ne cesse d'expansionner avec le temps!) agit beaucoup dans ce genre de manoeuvre...




Martin

Tuesday, September 12, 2006

Unit Testing but...

I really appreciate developing using the unit testing approach and as such I always have a JUnit library somewhere in my classpath while building an application. This really brings up my confidence into my code and allow me to refactor at ease without worry about breaking all existing functionality.
However, there are a few stricter recommendations commonly found among the unit tester fanatic or the extreme programmer advocate that I find, to say the least, debatable:


  1. your unit test should focus on a single method of a single class, otherwise it is not really unit test.
  2. always build your unit test first and then your application class afterward.

Point 1 actually emphasizes the term unit, and violating this makes your unit test more like integrated test which I agree. But in my view these tend to be more meaningful and practical.

First of all, methods should be small, precise, have a single clear responsibility and have descriptive name that conveys their purposes. As such I tend to agree with recommendation that limit the size of a single method (R. Johnson gave as a ballpark figure between 30-40 lines of code including all comments, while J. Kerievsky goes as far as recommending ten lines of code or fewer, with the majority of them using one to five lines of code). Keeping methods small and giving them intuitive name produce much easier and self-documented code: I like this idea since it helps reduce the need for documenting your code!

This is why I feel that the principle 1. above is opposed to the "writing short method" approach, since small method do not contain enough complex logic that requires a dedicated unit test on its own.


A junit class that test and validate the effect of each and every single method produces on the state of current object or some other dependents (through Mock-up objects) is often straightforward and thus overkilled! Also, a large number of method may not deserve a full dedicated test on them, since not only their logic is simple but also the impact on state is minimal.

That's why I twist my unit test a bit to make them more integrated test, i.e. test only important methods in the class in relation with their impact on itself and on its dependencies (external library, other piece of my code..). Ok, this is not always possible especially when the dependency library is costly and resource intensive component (then I'll use Mock-up for such case), but in very frequent usage, this allows me to validate and better understand the external library during my test as well as testing my code against its dependency. I find myslef even doing such integrated test with code at the service layer level (above the DAO layer) and validating its effect at the database tier. Using a small memory-based database engine such as HSQLDB helps negating the perfomance penalty of doing this.

As for the point 2, I usually adopt more of a concurrent approach, i.e. draft the application class and once it stabilizes create the test class and making it evolve simultaneously. The first few version of my class/interface are a bit too dynamic and sketchy to really have an accompanying test class. So to limit the need to duplicate my changes in both, I'd rather wait till I'm more comfortable with the class/ interface and then proceed with writing test case.
The only advantage I see in creating the test case first, is when I really don't know how my object's going to be used in the client code. However, in that case, I'd rather use a pencil and sketch some use case scenario beforehand...

Martin

Thursday, September 07, 2006

Handling persistence with ORM

In Java application development, using an Object Relational Mapper to connect to database typically will offer many advantages :

  • avoiding the necessity to code against lower level JDBC API
  • dealing with data persistence concern in a more transparently way more aligned with object oriented code paradigm
  • providing isolation to database vendor specifics allowing easy porting to a number of different DB backend
  • providing additional services support built-in such as connection pooling, caching, etc .
  • reducing the need to be highly skilled in SQL, although ignoring about relational concept and SQL is definitely not realistic.
  • writing less code

On the flip side, I've realized that there are drawbacks as well, such as:

  • providing least common denominator functionality to achieve DB neutrality
  • losing control of the SQL statetement automatically generated for us
  • some performance degradation, no matter what the tool vendor will pretend (ORM will always be one layer on top of JDBC...), however smart caching strategy can mitigate this
  • requiring additional knowledge of the ORM API (so less code to write but more library code to understand and make use of)
  • failing when the application use case is more focused on data reporting and aggregation of large data volume rather than on data entry transaction-based use case.

Typically on the last project I built using Hibernate, I've enjoyed spending more time on the design of a good domain model layer since I've spent less on the persistent logic concern. However, I discovered later through a more realistic usage and data volume test, that it suffered some nasty performance degradation in some specific use case that were not discovered through unit testing (unit testing is only concerned with functionality testing and not performance scaling issues).

Without going into details, the problem had to do with the number of round-trip Hibernate was triggering to fetch object data graph. I had designed some relation (1:1 or N:1) to be eager fetch (always fetch the related object) instead of using Lazy fetching strategy (fetch in database only when necessary). This was good in some scenario, since some data dependencies were always needed and this avoided a second database call to get this dependent object data. However when confronted with getting collection of data, the effect was actually a separate DB call for every single data elements within the collection. So getting a long list of N items resulted in N+1 DB call! Alternative solutions exist for this, but the recommendation is to model most (if not all) object relation using a Lazy strategy and adjust this default by specifying different fetch mode during run-time.

Bottom line, there is no magic bullet especially when it comes with database interaction. We need a good grasp in relation database concept in order to build application interacting with database, no matter what tools or framework you'll be working on.

Martin

Sunday, September 03, 2006

Data mining background

Data mining has its root, among other thing, in statistics and computer learning. To generalize things greatly, it is observed that depending on your background, these two will tend to view data mining performance very much differently..... Statistical background will rate performance on the statistical significance and inference power and score, whereas the computer scientist tend to measure performance on both the algorithm efficiency and scalability. However I realize that the two approaches really are two sides of the same coin, and this is reflected in the most recent scientific literature.

Various definition of data mining can be found in the literature, but I personally appreciate the more academic point of view than the marketing one commonly marketed by vendors. Here's some explanation excerpt taken from the excellent book « Principles of Data Mining » from David Hand, Heikki Mannila and Padhraic Smyth (MIT Press) (seems to be one of the few data mining books respected inside the statistical community).


It (Data mining) is recognized as well defined procedures that take data as input and produces output in the form of models (a summarized descriptive form of the data globally) or patterns (a descriptive form of some local phenomenon happening on a fraction of the data). The well defined procedure contrast with computational method which does not guarantee to terminate after a finite number of steps as opposed to a data mining procedure.

Data mining is concerned with building empirical models that are not based on some underlying theory about the mechanism through which the data arose but rather models consisting of a description of the observed data. In this sense, a good model is qualified as « generative »in the sense that data generated according to the model will share the same characteristics as the real data from which the model was generated.


They also offer an interesting decomposition of data mining algorithms into orthogonal components which contrasts with the magical and reductionist view marketed by tool vendors (always around the idea of simply applying specific algorithms to magically accomplish your task at hand). In essence, Data mining algorithms intends are to perform specific task on a sample or a complete multivariate datasets. But the task (1st component) is one of many other component that a mining algorithm usually address:

  1. Obviously, the Data mining task in question: whether it be visualization, prediction (classification or regression), clustering, rule pattern discovery, summarization through descriptive model, pattern recognition, etc.

  2. The functional form of the model or the structure of the pattern. Example include linear regression forms, non-linear functions such as the one resulting from Neural network, a decision tree form, a hierarchical clustering model, an association rule, etc. These forms delimit the boundary of what we can expect to approximate or learn.

  3. The score function used to judge on the quality of the fitted model used to summarize the observed data or of the pattern used to characterize a local structure of the data. This score function is what we try to maximize (or minimize) when we fit parameters to our model. It can be based on goodness-of-fit (how well the model describes the observed data), or also on the generalization performance, i.e. how well it describes on the data not yet observed (for prediction purposes).

  4. The search or optimisation method used: the computation procedures and algorithm used for maximizing the score function for a particular model. The search can limit itself to select the best parameters value within the k-parameters space (as in the case of k-th order polynomial function form) when the structure is fixed. And we may have to select first from a set or families of different structures.

  5. The data management technique to be used for storing indexing and retrieving the data. This aspect becomes primordial when it is time to process massive data sets excluding the use of the main memory alone.

The ideal tool would allow you to use different components independently from each other during some data mining activity. This level of agility and flexibility is however not possible in tools today... which may be justified and reasonable when it comes to the components optimization and data management but much less for the functional form and the score function components.

In practice, tools usually offer pre-packaged algorithms from which you can easily fall into the algorithm trap where you are only expected to apply some well established mining algorithm to accomplish magically the specific task at hand. This is the typical black-box paradigm that I've learned to despise in data mining (note that black box abstraction is overall beneficial especially in software OO programming model).

My curiosity simply forces me to step back and discover what's all the different mining algorithms applied to my data. After reviewing some of the literature, I've realized that I came across a lot of theory/practice in statistics during my curriculum, however I cannot say the same for machine learning (although I did some cluster analysis during my master thesis). So in the spirit of increasing your theoretical grasp of the underlying principles, let me give you a list of the books I highly recommend (in ascending order of statistical theory prerequisite):

Martin

p.s. I will try to summarize some the principles found in these books that I consider more than useful for any data mining practitioners to have. Although, blogs are not the ideal platform for such knowledge sharing, it is the most convenient one I have at hand (at least currently).