Wednesday, February 24, 2016

Neo4J, the Graph database

At work, we are confronted with the implementation of a database involving rich and densely interconnected data (.. more relationships than entities) and involving complex recursive query to run.  We thought it could be a good opportunity to test out Graph database.

In a nutshell, Graph database are a good candidate with:
  • highly connected dataset
  • data access through a multi-steps traversals (finding stuff in a network/hierarchy of nodes)
  • data access that is both flexible and unknown (path is changed according to nodes encounter in the way)
  • ultimate flexibility with thousands of classes (node type) having different properties with new classes/properties defined everyday!
We decided to go with Neo4J which seems to be the most stable implementation having the longest track record.  The following are notes that are relevant to Neo4J but some general advice are probably applicable to other implementations. 

I've taken some notes that I share here... (I've also worked more on the Cypher language and will try to post my Python notebook experimentation later).

1) Nodes

These are our entities characterised by attributes, called properties.  It is schema-less, so entity may have different properties. Things to avoid:

  • Nodes with rich properties.  Ex. Country node type with language spoken, currency, etc..  (it's a smell indicating these properties should probably correspond to Nodes) 
  • Nodes should be fine-grained, so avoid nodes representing multiple concept.
2) Relationship

The relationship are for access.  Access using global index is not typical, we do traversing instead using relationship are the primary means for nodes access.
  • Properties on relationship that represent entity (frequent with relationship like "WorkFor", "GrantedTo"..)
  • Used to access neighbouring nodes starting from one nodes and traverse many using the Relationship type and properties.
3) Model

The data model is fun and easy to do simply with whiteboard drawing: you model by Examples and not by Class! So quite suitable to do with business and used a communication tool.
The model is typically normalised and any properties appearing in multiple places should probably be nodes.


4) Anti-Patterns

- Avoid hefty nodes: User nodes with many properties like names, email, address, job, ... a lot of times, these correspond to nodes.

- missing nodes: John emailed Tim, make it 3 nodes with email a node also, and not a relationship between node John and Tim.

- Avoid hot nodes: nodes having a very large number of incoming relationships.


5) Graph Querying

2 approaches are possible for querying a Graph DB:
  1. Graph traversing
    • Start off with some nodes (the roots)
    • Spread out from these according to given path
    • Yield some side-effect in the process (intermediate computation, ..)
  2. Graph pattern matching 
    • Provide an abstract (sub)graph with any number of defined instances
    • Graph engine determines all subgraphs of the larger one fitting the pattern

The first category (imperative language) is used by generic RDF language like SPARQL and Gremlin (this one can support the two). The second one (declarative language like SQL) and is used by language like Cypher.

6) Loading Data 

There are a few ways we can load data into Neo4J: 1) transactional (live update through its REST-API), 2) batch (for high volume like initial load), 3) Cypher (similar to suing DML commands in SQL), or 4) through some API based on Java or other language.

Various tools can also be used for data loading, whether they are XLS-based, Neo4J-shell based, CLI-based (command line programs like batch importer), Neo4J browser application, or ETL-based.


Martin 






No comments: