Martin Ouellet Notes: May 2013

It seems the "NoSQL" term is now seen and used everywhere. This post discusses this new technology, along with its impact in relation to BI world.

Definition

Trying to find a clear and consensual definition is always a challenge with over-hyped and buzz words more used as a marketing advantage. Recently, well respected authors Martin Folwer and Pramod Sadalage wrote a clear and informative book on the full spectrum of noSQL technologies called "NoSQL Distilled: a brief guide to the emerging world of polyglot persistence".

According to the book author, no well accepted definition exists for NoSQL, the original name from the NoSQL Meetup was "open source distributed nonrelational databases". Rather than focusing on definition, the authors rather focus on a few common points all shared by NoSQL databases:

Often run off commodity cluster computing
Support of large or extremely large data volumes on clusters
Query language is not necessarily SQL (usually a close derivative language is used)
Usually Open-source and prominent in Web-based applications ecosystems
consistency is not done relying on relational ACID principles (an effect of running off independent cluster nodes)
Use Schema-less, i.e. no static data model structure mandatory

What about Cluster-based Solution proposed by Relational vendor
The commodity cluster models lead to a much cheaper and more scalable solution (scaling-out with much small nodes) than the scale-up alternative (ever increasing size of big monolithic and expansive server). The cluster version of relational databases (ex. Oracle RAC or Microsoft SQL-server) is not designed to be run off multiple independent nodes, as it actually relies on a common and share disk subsystem. Their model leads to fast and performant throughput with cluster-aware filesystem writing to a single highly available disk subsystem. This single point of failure is avoided in robust commodity cluster where a server node may go down (both server hardware internals & disk)) without impacting the cluster on-going execution (full failover and high availability is guaranteed).

NoSQL database: Metamodel classification

The author proposed a neat and convenient way to classify current NoSQL database according to how they store data, i.e. the metamodel used. The relational metamodel is based on relations (tables) and tupple (rows), current NoSQL uses rather these forms:

Key-value: Data stored at Aggregation* level using an "opaque" form

Each aggregate has a key or ID to get at the data
Each aggregate is opaque as data is stored using unstructured blob
(+) use to store whatever we like with just a constraint on size
(-) less flexible in querying & updating (cannot be done on part of a value)

Document: Data stored at Aggregation level using a visible and explicit semi-structured form (typical of JSON)

Each aggregate may have a key or ID to get at the data, but not necessarily as query can be done on any fields within the aggregate
Each aggregate is stored within a defined structure imposing a constraint on what's data is allowed
(+) more flexible in querying (part of a value can be queried or updated)

Columnar: Data is stored using a two-level map with grouping of columns including all rows.

Data is identified with a row-ID followed by a column-name normally part of a column-family
Query sequence pattern: Row-key --> Column-key --> "values"
(+) optimal with read-oriented app (ex. BI)
(-) not ideal for writing-oriented app

Graph: Special database using graph metamodel (i.e. nodes & edges) to store records with complex interconnections

Each entity or nodes can be queried in relation to interconnected nodes typical of social networks, product preferences..
(+) querying is optimal compared to the relational model which require FK relationship and expansive join operation
(-) focus on querying results in more complex & less performant data insertion
(-) less suited with storage across distributed cluster nodes

*The term aggregation is used to refer to data unit more complex than just a list of values or tuples. Typically, data unit is consisting of a complex set of records with associations and nested structure, all of which must be manipulated atomically (ex. think about one customer with all related addresses, or Order with all item details, payment info and customer attached to it).

Technologies

Looking at impressive number of open source implementations available (the list just names a few out of the 150 databases currently listed here), we have an indication of the potential growth of this technology, but also an indication of its maturity level... only a few important implementations could likely survive!

A lot of NoSQL database stores may not have clear-cut storage metamodel fitting exactly the definition of M. Fowler et al. metamodel taxonomy. So depending on interpretation, we may find some stores in various categories, please refer to NoSQL online reference for other interpretation and up-to-date data on this active field.

Key-value Store:

Redis
Voldemort
Oracle BDB
Amazon SimpleDB
Riak

Document Store:

Apache CouchDB
MongoDB
BaseX

Columnar or Column store:

Cassandra
HBase (Hadoop)
HyperTable
Apache Accumulo
MonetDB

Graph Database:

Neo4J
InfoGrid
Infinite Graph

BI perspective

I consider Graph database and Column-store to have the greatest impact in BI and to have the potential of being disruptive to current BI market.

For Column-store, this is easily understandable as BI is lot about scanning and other read-only data access where often just a few fields may be fetched out of much wider and flat record. As a testimony of this, we can look at the number of commercial vendors purchasing existing solution and/or integrating column-store into their own platform :

Sybase IQ now part of SAP, probably one of the original pioneer of columnstore in commercial scene
Vertica (purchased by HP), the 5-year old vendor specialized and fully dedicated to column-store
Greenplum (purchased by EMC) with their "polymorphic" data storage
Paraccel (very recently purchased by Actian) combine column-store with an in-memory engine
Oracle with a twist that they refer to as Hybrid columnar (rows still being used)
Microsoft SQL-Server proposes a compromised approach called "columnstore index" where data update is no longer permitted!

As for Graph database, I think great opportunity is possible by mining the relationship and interactions contained explicitly or implicitly within these graph content. Lots of research are currently done on work aiming at providing BI-style analysis against Graph-based data. Here're two great examples: http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/beweb2012/a3-bleco.pdf or else http://perso.ecp.fr/~aufaurema/Graphs%20for%20BI%20(printable).pdf

Martin

Martin Ouellet Notes

Wednesday, May 08, 2013

NoSQL: how disruptive for BI?

Featured Post

BI “ideal” platform

Search This Blog

Labels

Linked-In

Blog Archive

About Me

Geneva

Montreal

My Blog List