Wednesday, May 08, 2013

NoSQL: how disruptive for BI?


It seems the "NoSQL" term is now seen and used everywhere.  This post discusses this new technology, along with its impact in relation to BI world.


Definition

Trying to find a clear and consensual definition is always a challenge with over-hyped and buzz words more  used as a marketing advantage.  Recently, well respected authors Martin Folwer and Pramod Sadalage wrote a clear and informative book on the full spectrum of noSQL technologies called "NoSQL Distilled: a brief guide to the emerging world of polyglot persistence".

According to the book author, no well accepted definition exists for NoSQL, the original name from the NoSQL Meetup was "open source distributed nonrelational databases".  Rather than focusing on definition, the authors rather focus on a few common points all shared by NoSQL databases:

  • Often run off commodity cluster computing
  • Support of large or extremely large data volumes on clusters
  • Query language is not necessarily SQL (usually a close derivative language is used)
  • Usually Open-source and prominent in Web-based applications ecosystems
  • consistency is not done relying on relational ACID principles  (an effect of running off independent cluster nodes)
  • Use Schema-less, i.e. no static data model structure mandatory

What about Cluster-based Solution proposed by Relational vendor
The commodity cluster models lead to a much cheaper and more scalable solution (scaling-out with much small nodes) than the scale-up alternative (ever increasing size of big monolithic and expansive server).  The cluster version of relational databases (ex. Oracle RAC or Microsoft SQL-server) is not designed to be run off multiple independent nodes, as it actually relies on a common and share disk subsystem.   Their model leads to fast and performant throughput with cluster-aware filesystem writing to a single highly available disk subsystem.   This single point of failure is avoided in robust commodity cluster where a server node may go down (both server hardware internals  & disk)) without impacting the cluster on-going execution (full failover and high availability is guaranteed).


NoSQL database:  Metamodel classification 

The author proposed a neat and convenient way to classify current NoSQL database according to how they store data, i.e. the metamodel used.   The relational metamodel is based on relations (tables) and tupple (rows), current NoSQL uses rather these forms:

      1. Key-value: Data stored at Aggregation* level using an "opaque" form
        • Each aggregate has a key or ID to get at the data
        • Each aggregate is opaque as data is stored using unstructured blob
        • (+) use to store whatever we like with just a constraint on size 
        • (-) less flexible in querying & updating (cannot be done on part of a value) 
      2. Document: Data stored at Aggregation level using a visible and explicit semi-structured form (typical of JSON)
        • Each aggregate may have a key or ID to get at the data, but not necessarily as  query can be done on any fields within the aggregate
        • Each aggregate is stored within a defined structure imposing a constraint on what's data is allowed 
        • (+) more flexible in querying (part of a value can be queried or updated)
      3. Columnar: Data is stored using a two-level map with grouping of columns including all rows.
        • Data is identified with a row-ID followed by a column-name normally part of a column-family
        • Query sequence pattern:  Row-key -->  Column-key --> "values"
        • (+) optimal with read-oriented app (ex. BI)
        • (-) not ideal for writing-oriented app 
      4. Graph: Special database using graph metamodel (i.e. nodes & edges) to store records with complex interconnections
        • Each entity or nodes can be queried in relation to interconnected nodes typical of social networks, product preferences..
        • (+) querying is optimal compared to  the relational model which require FK relationship and expansive join operation
        • (-) focus on querying results in more complex & less performant data insertion 
        • (-) less suited with storage across distributed cluster nodes 

*The term aggregation is used to refer to data unit more complex than just a list of values or tuples.  Typically, data unit is consisting of a complex set of records with associations and nested structure, all of which must be manipulated atomically (ex. think about one customer with all related addresses, or Order with all item details, payment info and customer attached to it).


Technologies

Looking at impressive number of open source implementations available (the list just names a few out of the 150 databases currently listed here), we have an indication of the potential growth of this technology, but also an indication of its maturity level... only a few important implementations could likely survive!

A lot of NoSQL database stores may not have clear-cut storage metamodel fitting exactly the definition of M. Fowler et al. metamodel taxonomy.  So depending on interpretation, we may find some stores in various categories, please refer to NoSQL online reference for other interpretation and up-to-date data on this active field.

    1. Key-value Store: 
      • Redis
      • Voldemort
      • Oracle BDB 
      • Amazon SimpleDB 
      • Riak
    2. Document Store:
      • Apache CouchDB
      • MongoDB
      • BaseX
    3. Columnar or Column store: 
      • Cassandra
      • HBase (Hadoop)
      • HyperTable 
      • Apache Accumulo
      • MonetDB
    4. Graph Database: 
      • Neo4J
      • InfoGrid
      • Infinite Graph



BI perspective

I consider Graph database and Column-store to have the greatest impact in BI and to have the potential of being disruptive to current BI market.

For Column-store, this is easily understandable as BI is lot about scanning and other read-only data access where often just a few fields may be fetched out of much wider and flat record.  As a testimony of this, we can look at the number of commercial vendors purchasing existing solution and/or integrating column-store into their own platform :

    • Sybase IQ now part of SAP, probably one of the original pioneer of columnstore in commercial scene
    • Vertica (purchased by HP), the 5-year old vendor specialized and fully dedicated to column-store
    • Greenplum (purchased by EMC) with their "polymorphic" data storage  
    • Paraccel (very recently purchased by Actian) combine column-store with an in-memory engine  
    • Oracle with a twist that they refer to as Hybrid columnar (rows still being used) 
    • Microsoft SQL-Server proposes a compromised approach called "columnstore index" where data update is no longer permitted!  

As for Graph database, I think great opportunity is possible by mining the relationship and interactions contained explicitly or implicitly within these graph content.  Lots of research are currently done on work aiming at providing BI-style analysis against Graph-based data.  Here're two great examples:  http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/beweb2012/a3-bleco.pdf or else http://perso.ecp.fr/~aufaurema/Graphs%20for%20BI%20(printable).pdf


Martin