One important area of concern in terms of database flexibility is the data model structures used for integrating new content. This post will look at two completely different strategies used for building model structures that can easily accommodate change (at least compared to the typical 3rd normal form but not so for the typical noSQL form). These are :
In some way, these two structures are opposed to each other in their way to provide greater flexibility: EAV ingests new data content through its generic model, while AM uses highly normalised model for ingesting new content to adapt and grow incrementally without . Note the term content here implies the modification of data structure and adding new entity and not simply the addition of new record, although this distinction is blurred for the case of EAV.
I'll discuss the pros/cons of these two here.
The Entity Attribute Value structure uses an open schema to store Parameter/Value pair. This provides extreme flexibility in dealing with evolving scope of entities and parameters (here Parameter=Attribute).
This structure is good for : 1) high Sparsity Attribute values; 2) Attributes highly variable and unknown.
Pros:
- Fully evolutive: Data model and API layer can accommodate addition of any new Parameter type with some magic use of metadata (only metadata must be changed)
- Efficient storage for highly sparse data
- Seem common in neuroinformatics world (ref to metadata standard: odML)
Cons:
- EAV is more of an exception or anti-pattern in data modelling and is applicable for a few limited use cases
- Require a rich metadata infrastructure: data-related business rules no longer enforced by relational engine but by its metadata layer
- Higher complexity in data request, ad-hoc query (especially those cross-referencing multiple Attributes), data bulk extraction etc..
Implementation Impact:
- often need a dedicated layer for managing data query complexity (e.g. ElasticSearch)
- a no-SQL backend maybe more adapted (except for the metadata) unless there are clear requirements for concurrent transactional integrity and some form of isolation level
- When Historisation is a requirement, need to enrich EAV to support temporal attributes, and thus increasing querying complexity further. There does not seem to be any temporal field inside the EAV instance table of its typical reference implementation (see ycmi as an example)
2) AM structure
AM favors strong (and fine-grained) typing of database schema while reducing a few symptoms common to rigid relational data model. This is achieved by using higher normalisation structure (6NF).
The AM structure is good for 1) integration of unforeseen and evolutive data model piece-wise, 2) offering clear and semantic-rich schema.
Pros:
- Allows for data model evolution through organic extension without revisiting existing model structure
- Historisation of all attributes is built-in without any compromise
- Not penalised by highly sparse data (NULL never needed)
- Produce models that are expressive and clear while carrying insights to the particular domain
Cons:
- Require extension of both data model and API layer with each new data model type
- May suffer from table explosion (mitigated with a virtual access-layer)
- Need special Query execute plan optimisation (table elimination)
Implementation Impact:
- need DB engine with necessary optimisation (seem PostgreSQL meets some requirement, but should be further validated)
- assume entities have strong identity with available candidate for natural keys
- could require applying some form of generalisation to be less sensitive to addition of parameter (equivalent of ETL pattern when working in PULL mode)
- Data modeller needs to know enough insight and have good Data Domain knowledge (data model end-result actually looks similar to domain ontologies)
-----------
Data Model example
Here I try to model the scope that was presented to me at work. However, I have skipped a few details and generalise the concept furthermore,
In AM we should correctly separate Entity having clear identity (the Hub , shown in blue) from their attributes/descriptive/measures (the Attribute, shown in red), and identify relationship/association /transaction (the Tie shown in yellow). The important thing to retain is that AM offer end-result model that are expressive and provide insights to the particular domain. This is more easily done with one ore more SME.
In AM we should correctly separate Entity having clear identity (the Hub , shown in blue) from their attributes/descriptive/measures (the Attribute, shown in red), and identify relationship/association /transaction (the Tie shown in yellow). The important thing to retain is that AM offer end-result model that are expressive and provide insights to the particular domain. This is more easily done with one ore more SME.
An example of a Anchor Data model |
Note: this does not follow a strict AM form : Tie have attributes, naming convention not follow, more than one Attribute per table, audit metadata inside primitive, etc...
For the EAV, I used one reference implementation:
http://ycmi.med.yale.edu/nadkarni/EAV_CR_frame.htm. I've added historisation capability at the attribute/value pair (EAV tables) as well as adding explicit common attributes to all Entities type. Here, I've added natural-key for lookup and update existing entities (oddly enough there was none in the EAV ref, I guess these are buried inside one EAV attribute?).
This is inspired from the concepts of odML and its metadata relationship: Section →0..n Property →1..m Value. Section is used for grouping entities of same type and associated with a set of Property relevant to these entities. These Section/Property/Value will grow as we add new data type into the PARAM_SPACE:
An example of a EAV data model |
Martin
No comments:
Post a Comment