Martin Ouellet Notes

Friday, January 10, 2014

What is Anchor Modeling? (part-1)

Anchor Modeling (AM) is another approach well suited for modeling the integrated/consolidated data layer within an enterprise data warehouse. AM was originally created by Olle Regardt with formalisation done by Lars Rönnbäck. It offers an open source database modeling technique accompanied by an online modeling tool available through an MIT license. Most info is maintained around AM home.

The approach has some theoretical background with ideas borrowed from the information modeling world. More specifically it is inspired by a method called Object Role Modeling (ORM) developed by Dr. Terry Halpin. ORM, not to be confused with the Object Relation Mapping, is a modeling notation designed to help non-expert doing conceptual modeling of database schema.

The theoretical formalisation has also helped provide a very strict and prescriptive methodology compared to some more ad-hoc approaches used in the world of data warehouse (often heuristic-based developed with time and experience).

It also offers some implementation ecamples, but at this time, the only complete implementation is ported onto the Microsoft SQL-Server engine.

Comparison to Data Vault

I will not go into details as to what is Anchor Modeling here. Instead I will describe the similarities it shares with Data Vault (DV), since both aims to decompose any data model into smaller and more basic constituents. For those interested in detail description, please refer to this excellent article Anchor modeling - agile information modeling in evolving data environments written by its creators.

1. Anchor ≈ Hub

Both Anchor and Hub store Entities having strong identity. There is an important difference: in AM, an Anchor stands completely on its own, i.e. it only stores the surrogate-key (aka technical key) so keep natural-key (aka business-key) in separate Attribute table.

IMO, this offers additional flexibility especially when natural-key(s) are chosen based on operational systems. These tend to have shorter lifecycle than the EDW whose lifetime should be, at least what EDW team hope for, infinite! By keeping natural-key(s) outside the Hub, we avoid making any assumption as to what are the natural-key(s) and their data type(s) in future. AM propose to use a Natural-key View for handling the loading and look-up logic, and any change in natural keys would involve updating this View and adding the new Attribute table only.

2. Attribute ≈ Satellite

Attribute is similar to the Satellite notion in DV. Both adds contextual and descriptive data which are inherently time-dependent. However to preserve the 6NF temporal integrity, AM enforces that each Attribute lands in dedicated table. DV is more relaxed here and accepts any number of attributes in the same Satellite.

AM supports either temporal Attribute (referred to as historised in AM modeling front-end tool), or static and immutable Attribute where no temporal validity period is required. DV modeling guideline assume historisation by default, and requires to store the full time-segment with two distinct time points: Valid-from and Valid-to. Again the higher-normalised nature of AM restricts us from storing the redundant Valid-to time point (= subsequent Valid-from time point).

The multi-attributes in DV imply that we must proceed with row comparison to know which one has changed between active record and previous one. However, I'd argue that AM rule 1-Attribute = 1-table is too strict, and could be relaxed for some exceptional cases. One example are attributes having strict functional dependency on a master attribute; think of user-friendly text attribute providing descriptive info for technical short code. Other examples are for immutable attributes guaranteed to
remain so in the lifetime of the entity they describe.

3. Tie ≈ Link

Tie is similar to the Link notion in DV whose role is to store the relationship between Anchor/Hubs entities . However Tie does not support over-hanging Attributes the same way Link have their own Satellites. This limitation is due to the fact that historisation is built-in for Tie as with Attribute. Switching-on historisation adds Valid-from time point to the Tie's Primary-key, and consequently any Attributes referring to the Tie would be left dangling (or else requires duplicating the Attribute row in violation of 6NF rule).

The drawback of AM follows that for any natural many-to-many associations having inherent attributes will involve the creation of a “tied anchor”. DV is more flexible here, as it allows attaching Satellites to Link. In DV time-period is not built-in inside the Link structure, so we are left with constructing the history of the relationship ourselves by adding time segment in external Satellite. This is required when we want to keep track of time-period validity of any relationship.

Tie may also have some form of attributes like when we deal with relationship having state/role information. Under this circumstance, one can add a Knot attribute to the Tie, becoming a "Knotted-tie" that can now support historical change of its state/role (the knot carries this state information). With this scheme, you should have at most one Knotted attribute outside the relationship identifier (or another un-identifying relation in extra inside the Tie) for keeping track of the state or role information of the relationship (or the extra role-key tie).

4. Knot ≈ Reference

Knot is similar to the Reference notion in DV. Both of them hold immutable data of lower cardinality. They normally represent a set of fixed categories, codes or static attributes and are normalised into separate table to avoid update anomalies and duplicating string values throughout the database. Main difference is that Reference in DV can store multiple strings whereas again in AM only one string code per table is allowed. It seems though that Knots are first class citizen, whereas in DV, Reference seems more like an ad-hoc optional structure. Knots also enhance the semantic of its reference, for ex, an Attribute referring a knot becomes a knotted Attribute, a Tie becomes a knotted Tie, both being either static or historised.

A small digression on Time

Temporal aspect of AM:

The notion of Time in AM is well framed semantically with clear definition. It provides support of 3 notions of Time:

Happening time. Corresponds to when some events/transactions occurred: as represented by specific Attribute(s) attached on Anchor
Changing Time. Corresponds to when some Attribute/Tie (when historised) are valid: as represented by the built-in “Valid-From” Time point in Attribute/Tie.
1. Note that no redundant “Valid-To” is added to close the time-interval in order to comply with the 6NF requirement. This avoids the nasty “update” during data loading process, and the potential risk of recording invalid time period. But this comes at the price of additional complexity for reporting valid temporal attribute at specific time.
Recording time. Corresponds to when the data was recorded into our platform (i.e. loading metadata): as represented by FK’s referencing a particular log entry in metadata Audit tables.

Externalising the “Recording time” has the advantage of separating data from metadata, where all timestamp found in the data structures correspond to functional dates whereas metadata is located outside these data structures.

This contrasts with DV where metadata-info sits alongside the data. Although the metadata timestamp “Load_Start” is not supposed to carry any functional meaning, it is confusing that most Query/Report examples given in DV documentation use this metadata timestamp for returning the valid data at given point in time! Time when data is loaded are rarely correlated with Time of data validity. This can happen in some occasion, like we have no other alternatives (see case 3 below). The way I see things, when we need to determine the right functional “Valid-from”, we can be faced with 3 different situations :

Dates do exist explicitly at source, so we load them as-is (typical of data having specified lifespan with effective/expiration date , etc..)
Dates do not exist explicitly at source but can be deducted from some technical dates t source, so we use these technical dates (typical example is source having LastStatusUpdate meta-info allowing us to know when the data has changed)
Dates do not exist explicitly at source neither can they be deducted from other technical dates, so that one is left with the only option of using the Loading time

Ideally, situation 3 is more the exception than rules in most EDW implementation.

Main advantages

Most advantages of the AM approach results from its highly normalised structure. This offers benefits such as:

Simpler Data loading involves cheap “insert” and no expensive “update/delete”
- Fault recovery is more easily done since all rows data have a unique batch load traceability and can be deleted on exceptional situation (when erroneous)
- Strict "No Null" policy : null values never appear, thus eliminating difficulties like interpretability, indexing, and other issues.
- Non-persistent elements never removed but flagged with Knot indicating their state of persistence
- Maintenance is overall improved and simplified

Data model evolution only involves extension of previous version
- New content always involve incremental addition of either Anchor / Tie / Attribute / Knot structures
- These New data structures may, sometimes, render some existing structure obsolete, then we can simply leave them as-is with no onward refresh.

Potential Higher Performance (here, it’s more tricky as the higher normalisation can also penalise performance, but the authors have highlighted a few valid points worth mentioning)
- Narrow Query (hitting small number of attributes) can have very good performance with the aid of Table Elimination
- Storage is highly efficient since practically no data redundancy exist, only a small number of indexes is actually needed due to the use of clustered table (aka index-organised table in different DB vendor) and also narrowness of table. On this last point, it'd be nice to see how column-store architecture could be leveraged to exploit the AM unique schema.
- Data loading practically free from locking and other concurrency issues as only “insert” are processed.

Reduce translation logic between different representation: the unique AM graphical representation is used for both conceptual and logical modeling
- Furthermore, the physical representation maps directly model entities onto tables, simplifying modeling abstraction levels or even eliminates the need for translation logic.

Martin

Wednesday, October 30, 2013

Working in Jordan

At work we have won a contract in Jordan recently. This gives me the opportunity to make longer stay in this country and enjoy more than just the few days typically spent at the hotel.

This is my first work experience in middle east, and there is a lot to learn being surrounded by a very different culture and habits than what you are used to… which is nice, I’m always keen in discovering other way of life.

I am working in Amman, capital of Jordan, with roughly 3 million people or about half the population of the whole country. What strikes any visitor about this city is its high density! You can appreciate this in various places as it is located in a hilly surrounding:

Depending on your location and elevation, you see buildings with very similar look and color as far as your line of sight can reach:

Shot taken from my hotel top floor

Picture taken from the archeological site Citadel (Jabal al-Qal'a).

The streets are very animated ... and noisy as lots of car are honking for no apparent reason. I later realised that a lot were directed toward me!! Yes I look pretty much foreigner and I certainly do not act like a local (I haven't seen many walking long distance on the street of the city, so all taxis saw me as a potential client).

You can really get a sense of the popular crowd scene during friday bazaar (or Souk to be more accurate):

The whole region's history traces back quite a long time in the past, and many historical sites can be visited within the city. For ex, the site Citadel is located on a “belveder” at the heart of the city. This site has witnessed a number of different civilisation dating back from era as far back as neolithic. It was also an important place during the Ottoman period and the Nabataeans, a very ancient arab civilisation that constructed Petra.

Various civilisation that were established at the Citadel

Apparently the lack of water in this region is not new as proved by the presence of these vast reservoirs. The old civilisation quickly had to find ways to capture the precious rain water so scarce in this region of the world (there is less than 20mm of precipitation during 7 months in the summer with some month with literally no rain!).

There is also the presence of what is believed to be Hercule's hand... one of the remains of probably the largest statue of the roman empire:

Citadelle-templeRomain-avec-restedeStatue

Although there are not many Hammams found in the city as you'd expect from a middle-east city, there are ruins of this centuries-old tradition in the Citadel.

The artefacts found on sites is astonishingly old…. some cave dates back from Bronze era!

There is also just below the Citadel a roman theatre which I found very well preserved… and they actually still host event there:

The Mosque Abdallah is one of the very few that non-muslim can visit. It was recently constructed in the memory of the very first king Abdallah of Jordan kingdom.

Although quite “meridional” I found the city's climate rarely unbearable at least until late spring. I doubt that during mid-summer we can enjoy cool night that I did. But still the air is quite dry and the fact that it is located on a high plateau helps mitigate the heat.

Further toward the east, we can go to the lowest point on earth: the Dead sea located below the -400m (the exact elevation is constantly decreasing due to the accelerated evaporation!). Going to this sea from Amman is quite an adventure, first the temperature will typically soar well over 10 degrees and the air pressure will also increase quite a bit! The night we went there, it was chilly in Amman about 18C but at the sea it was still around in the range 0f 28-30C degree! You quickly feel this place is special even before going to the sea. It must be related to its high pressure atmosphere, low allergen air content and low UV radiation!

The sea water is also highly mineralised that it seems we swin in a visqueous soup with smell that was closer to metalic than water ...very strange! Don't even mention the floating aspect which makes swimming a bit dangerous as your body has the tendency to pivot along its center of gravity, and point your head under the water line! And you certainly don't want your eyes to even contact this highly corrosive water.

My first "Selfie" taken by my front camera... the only functioning lens I had.

Unfortunately I just had a half-broken camera over there. I could not have my picture taken while comfortably reading a newspaper and lying in a gravity-free position (this is a must for any well respected tourist at Dead Sea). So instead I ended up taking my very first "selfie" (my daughter later made me realise this) as a tentative to keep a trace of the incredible -400meter mark!

In a different weekend, I also went to visit Petra! This was amazing and would deserve a full dedicated post.

Martin

Saturday, October 05, 2013

Beautiful Creta

Il m'arrive parfois d'écrire sur un voyage quelques temps après l'avoir terminé. Ici, je pousse l'expérience encore plus loin, car presqu'un an s'est écoulé depuis notre retour de la Crète. Ceci s'avère un excellent exercice de mémoire et en même temps me fait réfléchir sur les raisons qui m'incitent à écrire sur ces expériences familiales.

Avant toute chose, pourquoi ces écrits? Il y a bien-sûr l'aspect d'exposition ou d'exhibition au monde extérieur, si recherché par la génération facebook. Mais je ne crois que ça soit ma motivation, première, d'ailleurs je n'ai ni compte facebook, ni instagram, ni twitter, ... et je n'ai jamais fait d'effort pour médiatiser ou faire connaître ce site. Donc, plus vraisemblable, je crois que c'est un besoin fondamental de mémoire, mémoire pour moi et ma famille rapprochée. Je me réjouis à l'idée que mes enfants et leurs futurs enfants pourront consulter ces souvenirs de famille dans cinquante ans!

C'est vrai que je pourrais procéder de manière complètement privé, mais au rythme des changement de logiciels, d'ordinateurs ou encore de support de mémoire, le risque de tout perdre m'apparaît assez élevé. Et admettons que j'y mette toute l'effort pour conserver ces souvenirs, qui assurera cette conservation parmi mes générations futures... ok, ok pas sûr que j'aurais plus de chance avec ce "host".

J'ai donc choisi la facilité en évitant ces efforts et en pariant sur la survie d'un géant du web comme Google ... au prix d'une partie de ma vie privée. Quoiqu'ici, je choisi seul, sans contrainte et de façon délibéré, tout le contenu exposé, ce qui est loin d'être le cas du modèle facebook qui est souvent incite les gens à exposer inconsciemment et de façon insidieuse une bonne partie de leur vie privée (ex. lors d'échange et de commentaires extérieurs).

Après ce long aparté, revenons au voyage. Nous avons donc été en Crète à l'automne 2012 et ce du côté nord-ouest de l'ile tout près de Chania (La Canée). Mes intérêts premiers, pour les voyages, sont plus tournés vers l'observations de phénomènes naturels que celui des traces laissées par l'homme. Les transformations géologiques aux échelles de temps non accessibles aux humains restent pour moi, sans commune mesure aux constructions humaines!

J'avais pour objectif de voir deux points d'intérêts: les gorges de Samaria et les oliviers de Vouves, les premières étants les plus longues de toute l'Europe et les seconds étant apparemment, les plus vieux (et peut-être un des 10 plus vieux du monde)! Mes enfants avaient plutôt émis le souhait de voir le temple supposé du roi légendaire Minos (et le fameux monstre Minotaure), situé à Knossos (près d'Heraklion), et ce suite à leur cours sur la mythologie grecque. Sinon, je ne connaissais rien d'autres de cette île, sauf sa réputation d'offrir une des cuisines le plus santé de monde: le régime crétois!

Donc, voici un résumé de ce voyage, en photo montages, devenus en quelque sorte l'extension de notre mémoire d'humain ;-) !

Voici le fameux olivier à l'âge vénérable de 3000ans
... qui n'est pas touché devant un être vivant de cet âge?

Et oui il se visite même de l'intérieur comme ont pu constater mes enfants!

ceci est le dernier escarpement rocheux toute à l'ouest de l'île,
de l'autre côté se trouve un trésor naturel nommé le Lagon de Balos

Les chèvres de montagnes (souvent nommé Kri-Kri) arrivent miraculeusement à survivre sur ces rochers arides...

La particularité de l'ile est sans aucun doute son eau crystaline...

..ce qui a permis aux enfants de découvrir les joies du snorkeling

et de profiter de la température encore clémente de l'eau
malgré un mois d'octobre bien avancé

Dans le vieux port de La Canée se trouve des vestiges de la conquête islamique turc (église convertie en mosquée)

..le climat aride donne des teintes particulières aux nombreuses falaises plongeants dans la mer

une vue du port de La Cana situé au sud de l'ile, partie difficile d'accès!

... comme on peut bien le voir ici, côté sud de l'ile est plus déchiqueté et beaucoup plus
sauvage que la partie nord

D'ailleurs c'est là qu'on y retrouve les gorges Samaria...

Je me suis rendu seul à cette gorge, mais la montée en valait le coup...
au point le plus serré, seule quelques mètres séparent les deux parois

les traces d'érosion sur le roc laissées par l'eau se retrouvent tout au long de la gorge

Nous nous sommes rendus à ce rocher à la demande de la mamie ...
qui, est en fait, le lieu de tournage d'un film marquant de l'époque des années 60!

.. qui a reconnu le film ... et oui "Zorba Le Grec"

Curieux arbre avec des fruits énormes aux formes étranges de courges?

Pour finir, voici LA photo du voyage prise ma mère!
le coucher du soleil à rendre jaloux les meilleurs photographes

Pour la petite histoire, nous avons malheureusement raté notre visite du temple de Minos.., avec les difficultés financières du pays, les heures d'ouvertures des sites historiques ont été réduites. Nous avons raté de quelques minutes l'ouverture du site, désolé les enfants!! Par contre, ce n'est que parti remise, car on devra revisiter cette île, surtout suite au visionnement d'un documentaire qui met en lumière la disparition de la civilisation Minoenne! Selon de récentes découvertes, cette disparition serait due à un tsunami... à regarder!

Martin

Tuesday, August 06, 2013

BI “ideal” platform

When dealing with BI projects, you are much more likely to work under the constraint of existing platform than building one from scratch. This means having to deal with existing idiosyncrasies, sub-optimal architecture, complex and big data models, tangled ETL loading dependencies and scheduling, confusing or mixing different approaches and architecture, etc.

This post presents my “ideal” platform blueprint I would use when no existing platform is in place which compromises good design principles. This ideal BI platform leverages the Multilayered architecture's concept borrowed from the more general Design Pattern in software design engineering. The advantage of layering is obvious enough:

Visibility: any layer can only depend on the immediate preceding layer
Functional role: each layer focus on clear and none contradicting roles
Volatility:
- Delivery (highest) layer are made up of components that are dependent only on end-user BR changes (i.e. new “business” rule should only affect Delivery layer)
- Raw Consolidated (lowest) layer are made up of components that change according to technical/system changes happening in operational world
- Business (middle) layer isolate lowest and highest layers from each other and from their life changing cycle (here the Business-oriented data of the Consolidation layer is the best place to design this in-between layer)

All layers control and isolate data dependencies : each layer is only sourced (or dependent) from its direct layer underneath, and layer only expose data to layer above.

The following diagram depicts an high-level view of this idealistic BI platform:

Obviously, this diagram’s focus is on simplification, detail of less important are not represented; ex. possible staging areas in between each layer, metadata layer (for ETL business rules, KPI and Report definition, ..), delivery layer usually has an eclectic choice of architecture based on technologies and tools.

Source Layer

Layer Definition

This layer is simply all systems collecting and generating data. Not much can be said here as this layer does not fall under the BI group’s responsibility. However one design criteria note worthy concerns its physical implementation and is meant to avoid disrupting the on-going operational process :

We should never target the real “Production” system, but Leverage whenever possible some data replication/copy mechanisms (hardware or software -based)
Persisted Staging Area or ODS are different ways for BI to get their own our copy of source data with minimal impact on source

Integration/Consolidated Layer

Layer Definition
This layer is the one closest to the original Inmon’s def: “subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions”. Some may refer to it using different name like Integrated Layer, but the fundamental idea remains the same: consolidate and store multi-source data in one database repository.

All data generated by the organisation’s activity and business processes (automatic, semi or even manual) are conveniently centralised here. We can further distinguish two sub-layers :

“Raw” layer which captures all data as-is to ensure full data lineage and traceability
“Business” oriented layer which adds business-provided data (reference, grouping, hierarchical structures..), and provides some early data transformation for reducing downstream transformation complexity for the Delivery layer.

Functional Role of Raw Integration-Layer:

Provide the first integration and consolidation point of multi-source data
Capture and archive all data as-is (no modification nor transformation) : Load 100% of data 100% of the time
Provide the full historical view of the detailed data: “System of records”
Identify and load master data source for important and recognized business entity
Attach “platform-neutral” Surrogate keys to all natural keys of business entity
Align entity across different source
Standardise through an easy and repeated data modelling patterns all data

Functional Role of Business Integration-Layer:

Data structure designed to apply business rules transformation
Bridge and link non-conformed entities across data sources
Provide Grouping, Hierarchies, and other business-provided data
Create structure to decrease the Delivery layer transformations complexity
Fragment complex transformations rules into smaller, more manageable and auditable steps
Recognise and flag conformed Entity across different data source
Flag “erroneous” or “wrong” data as such (rejecting data to be done in Delivery layer)

Layer-Design Criteria (raw and business)

Use a uniform Data modelling technique
- Do not merely copy and add timestamp/audit data to source OLTP Data models
- Use a standardized approach and stick to it for every data source integrated
- Avoid data model diversities and idiosyncrasies typically found in source layer (harder to learn and more costly to absorb)

Use a Highly Normalised data structure*
- Methodology like Data Vault or other similar approaches Anchor-modeling
- Discourage usage of large data structure (these structures result in important downstream dependency, rigid relationships, and are associated with unstable and hard to maintain ETL logic)

Favor Easy Data Loading over Easy Data Reading (compromise to pay)
- Must be Easy and Quick to load new data structure
- May be Hard and Messy to query existing data structure

Allow for growing organically/incrementally
- Start small with one source and a few data components
- Build small incremental step gradually as you ingest more data source

Accept gracefully system source changes
- At best, every change should simply “add” on the existing (no dependencies downstream impacted)
- At worst, every change should only affect a small number of “localized” data structures (very limited dependencies impacted)

Accept gracefully addition of new source
- Each new source is integrated independently
- “Bridge” these independent “universe” (data source) through easy flexible model structure, like the concept of Link in Data Vault.

Support full traceability and auditability of source data
- Don’t “fix” and don’t do any data cleansing
- Leave “erroneous” or “wrong” data AS-IS (these come for a reason)

Favor understandability through simple and repeatable data model patterns
- Both raw and business layer must leverage identical data modeling techniques and standard
- Data model easily recognized and understood and by “unaware” (new) eyes

*Highly normalized structure :

To keep track of a temporal database using database normalization, it is well accepted that a normalization less than 6NF has shortcomings (see 6NF Conceptual Models and DW 2.0). The 6NF form ensures that every non-key gets stored into its dedicated Relation. In this purest form, we’ll end up with all Relations having one Key plus one non-key attribute, hence making the Relation irreducible (keeping track of history is simply managed by adding time semantics).

Practical approaches normally accept some kind of compromise to this purest form (eg. DV would combine multiple attributes within a single Satellite). But they still benefit from major advantages of higher normalization:

+Model much more Resilient to changes in the outside world of the EDW (able to non-destructively evolve the model)
+Model are Historization-Ready making all attributes potentially temporal (possible by design)
+Model limits data redundancy even for temporal and time-variant data
+Model can support more performant Zero Update strategy
+Model results in the elimination of NULL and support early arriving facts (entity loaded only depend on an Identity)

Delivery or Presentation Layer

Layer Definition
This layer provides the Business view and/or Departmental view (different version of the truth) of the data. It is accompanied with Reporting tool and any Data Analytics platform used by business people inside the organisation.

Delivery-Layer –Functional Role

Support complex business rules implementation
Provide flexibility to change data transformation rules
Offer a sensible view to the business community of all organization’s operations and processes
Offer a cleansed and standardized view of the often messy operational data
Provide support decision-making tools and platforms
Provides data to allow for more advanced analytical and mining activities

Delivery Layer -Design Criteria

Favor simple and easily communicable data model
- Employ recognized Methodology such as dimensional and star-schema
Favor easy Data Reading over hard Data Loading (compromise to pay)
- Must be Fast and Easy to query data
- May be Hard and Messy to load data
Accept gracefully new Hierarchies, Grouping or other Business provided-info
Fully support multiple “version” of the data (departmental or organizational view )
Accept gracefully business rules changes (can be frequent)
- At best, every business change should simply mean to re-construct the Virtualized layer (no re-loading necessary)
- At worst, every business change should only affect very “localized” data loading jobs
Favor Communication of data model tied to Business terms and meaning
- Data model easily understood and accepted and by “new” business eyes
Implementation more dependent on Vendor tools: follows platform features, choice and characteristics

Martin

Wednesday, May 08, 2013

NoSQL: how disruptive for BI?

It seems the "NoSQL" term is now seen and used everywhere. This post discusses this new technology, along with its impact in relation to BI world.

Definition

Trying to find a clear and consensual definition is always a challenge with over-hyped and buzz words more used as a marketing advantage. Recently, well respected authors Martin Folwer and Pramod Sadalage wrote a clear and informative book on the full spectrum of noSQL technologies called "NoSQL Distilled: a brief guide to the emerging world of polyglot persistence".

According to the book author, no well accepted definition exists for NoSQL, the original name from the NoSQL Meetup was "open source distributed nonrelational databases". Rather than focusing on definition, the authors rather focus on a few common points all shared by NoSQL databases:

Often run off commodity cluster computing
Support of large or extremely large data volumes on clusters
Query language is not necessarily SQL (usually a close derivative language is used)
Usually Open-source and prominent in Web-based applications ecosystems
consistency is not done relying on relational ACID principles (an effect of running off independent cluster nodes)
Use Schema-less, i.e. no static data model structure mandatory

What about Cluster-based Solution proposed by Relational vendor
The commodity cluster models lead to a much cheaper and more scalable solution (scaling-out with much small nodes) than the scale-up alternative (ever increasing size of big monolithic and expansive server). The cluster version of relational databases (ex. Oracle RAC or Microsoft SQL-server) is not designed to be run off multiple independent nodes, as it actually relies on a common and share disk subsystem. Their model leads to fast and performant throughput with cluster-aware filesystem writing to a single highly available disk subsystem. This single point of failure is avoided in robust commodity cluster where a server node may go down (both server hardware internals & disk)) without impacting the cluster on-going execution (full failover and high availability is guaranteed).

NoSQL database: Metamodel classification

The author proposed a neat and convenient way to classify current NoSQL database according to how they store data, i.e. the metamodel used. The relational metamodel is based on relations (tables) and tupple (rows), current NoSQL uses rather these forms:

Key-value: Data stored at Aggregation* level using an "opaque" form

Each aggregate has a key or ID to get at the data
Each aggregate is opaque as data is stored using unstructured blob
(+) use to store whatever we like with just a constraint on size
(-) less flexible in querying & updating (cannot be done on part of a value)

Document: Data stored at Aggregation level using a visible and explicit semi-structured form (typical of JSON)

Each aggregate may have a key or ID to get at the data, but not necessarily as query can be done on any fields within the aggregate
Each aggregate is stored within a defined structure imposing a constraint on what's data is allowed
(+) more flexible in querying (part of a value can be queried or updated)

Columnar: Data is stored using a two-level map with grouping of columns including all rows.

Data is identified with a row-ID followed by a column-name normally part of a column-family
Query sequence pattern: Row-key --> Column-key --> "values"
(+) optimal with read-oriented app (ex. BI)
(-) not ideal for writing-oriented app

Graph: Special database using graph metamodel (i.e. nodes & edges) to store records with complex interconnections

Each entity or nodes can be queried in relation to interconnected nodes typical of social networks, product preferences..
(+) querying is optimal compared to the relational model which require FK relationship and expansive join operation
(-) focus on querying results in more complex & less performant data insertion
(-) less suited with storage across distributed cluster nodes

*The term aggregation is used to refer to data unit more complex than just a list of values or tuples. Typically, data unit is consisting of a complex set of records with associations and nested structure, all of which must be manipulated atomically (ex. think about one customer with all related addresses, or Order with all item details, payment info and customer attached to it).

Technologies

Looking at impressive number of open source implementations available (the list just names a few out of the 150 databases currently listed here), we have an indication of the potential growth of this technology, but also an indication of its maturity level... only a few important implementations could likely survive!

A lot of NoSQL database stores may not have clear-cut storage metamodel fitting exactly the definition of M. Fowler et al. metamodel taxonomy. So depending on interpretation, we may find some stores in various categories, please refer to NoSQL online reference for other interpretation and up-to-date data on this active field.

Key-value Store:

Redis
Voldemort
Oracle BDB
Amazon SimpleDB
Riak

Document Store:

Apache CouchDB
MongoDB
BaseX

Columnar or Column store:

Cassandra
HBase (Hadoop)
HyperTable
Apache Accumulo
MonetDB

Graph Database:

Neo4J
InfoGrid
Infinite Graph

BI perspective

I consider Graph database and Column-store to have the greatest impact in BI and to have the potential of being disruptive to current BI market.

For Column-store, this is easily understandable as BI is lot about scanning and other read-only data access where often just a few fields may be fetched out of much wider and flat record. As a testimony of this, we can look at the number of commercial vendors purchasing existing solution and/or integrating column-store into their own platform :

Sybase IQ now part of SAP, probably one of the original pioneer of columnstore in commercial scene
Vertica (purchased by HP), the 5-year old vendor specialized and fully dedicated to column-store
Greenplum (purchased by EMC) with their "polymorphic" data storage
Paraccel (very recently purchased by Actian) combine column-store with an in-memory engine
Oracle with a twist that they refer to as Hybrid columnar (rows still being used)
Microsoft SQL-Server proposes a compromised approach called "columnstore index" where data update is no longer permitted!

As for Graph database, I think great opportunity is possible by mining the relationship and interactions contained explicitly or implicitly within these graph content. Lots of research are currently done on work aiming at providing BI-style analysis against Graph-based data. Here're two great examples: http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/beweb2012/a3-bleco.pdf or else http://perso.ecp.fr/~aufaurema/Graphs%20for%20BI%20(printable).pdf

Martin

Friday, March 01, 2013

Data Vault model: Mobile Telecom example (part-3)

Here's my last note concerning "Data Vault model: Mobile telecom example" where some points and issues are discussed.

Open Points and Discussion

→Business Key inconsistencies
Among all OLTPs used as source, it is very unlikely that they will all share exact same Key(s) to represent same business entity ! If we are lucky, keys will vary in format only, but more frequently, keys can be totally different.

This challenge is referred in the DV community as the "Business Key Alignment", where raw keys coming from OLTPs must somehow be integrated (and thus aligned) toward the universal EWBK.
Next point discusses these points for the most important Key of all Telco: Subscriber Business Key.

→Subscriber identification
In telco, any subscriber is uniquely identifiable through its phone number (MSISDN) but only at given point in time (subscribers can change phone number!). This makes MSISDN only a valid temporal-based Key. We are normally more interested in tracking people than number, then we certainly need an alternative business key. Billing systems already accounted for these changes of numbers, so these should be used as the Master Data Source for our Subscriber Hub’s Business key. Other transactional systems in telco, will often make use of temporal-based keys, that is fine in for transaction-based context that always refer to a point in time. Another example of this is the IMSI which is the unique subscriber number soft-wired at the SIM card level, and somewhat more stable than MSISDN (still an issue when subscriber loses his SIM).

These points suggest important properties needed for Hub Business Key:

Business Key should be immutable
Business Key should come from one Master Data Source
Business Key may not always be well recognized by business users, especially not at the sacrifice of immutability.

So here, we decide to choose as Business Key, the Subscriber-Id having these properties, and maintained in our Telco Billing platform (Master data source).

Now, how should we model the other mutable Keys? They will likely be needed while we integrate other OLTPs platforms which rely on them. To be able to map transactions, links coming from these others platforms, we need to record the state of the subscriber mutable keys (ex. MSISDN) according to the chosen Hub key. As such, we propose to hook them up through the use of dedicated Satellite giving their key values at any point in time (given a Business Key).

Fig Data model around the Subscriber Hub

→Satellite Design Criteria
When it is time to model Satellites around each Hub we have quite a lot of freedom. We can think of a spectrum modelling style where we have at one extreme One Satellite per Hub, and at the other extreme One Satellite per Attribute (from completely denormalized to highly normalized).

We can consider these criteria for modelling guidance :

Rate of Change (group attributes of similar life-cycle)
Types of Data (group attributes of similar nature)
Record source system (group attributes by their source) &nbsp

However, I would add another important criteria: Clarity and Importance. It is well accepted that some attributes play a much more crucial role than others. Take for examples, Subscriber which are commonly used to derive important downstream semantics, like business logic, segmentation, grouping, etc (compare these with Attributes such as Address-data like Street Door-no).

So for these crucial Attributes, we model them in their own separate Table so that we can clearly spot them while browsing our DB schema metadata (see ex. SAT_SUB_STATUS in figure 2). And more importantly, mixing attributes will lead to additional records (due to historical change tracking). Our satellite will time slice against attributes of interested, and thus optimized for faster query.

→Call or Network Event
How should we model CDR’s data? Two issues arise when one wants to model these :

Business keys that can possibly identify each record not straightforward (complex and also prone to changes in time)
High data volumetry and burst rate may impose additional constraint on modelling

Can we classify a Call (Network event) as being a recognized business entity? It has clearly some characteristics, however it cannot easily be identified through limited business-keys. One could imagine that the following key combinations: 1) Calling MSISDN, 2) Called MSISDN and 3) Event_start_date.

These seem to satisfy the uniqueness criteria (at least for scenarios such as receiving voice mail at same time as making another phone call, etc...), but things get much more complicated now with the introduction of smart phone and their multitude protocol and access mode (HTTP, email, VoIP, VPN, ftp, Peer-to-Peer, TV streaming...). So to me, the Network Event cannot be modelled as a Hub, but rather as a pure Link.

Now, concerning data volumetry & burst rate. If we model these as Link and normalize all contextual data into numerous Satellites, we end up with many join operating on extremely large tables ! Furthermore, we can’t even take advantage of star join structure (with much smaller dimension tables having bitmap indexes surrounding the big fact table), as all satellites have the exact same size as the Links tables. In CDR’s context data are merely static, so there should not be any need to design Satellite to keep track of change in time.

SO here I'd argue we can loosen Link modeling principle and design what is called a Transaction Link, I think it makes perfect sense for our CDR’s scenario, as illustrated below:

Fig Network Events are modelled as a Transaction Link

Also worth mentioning is that most relationships are optionals, this is typically true when integrating multiple source systems (rated calls from billing, unrated from mediation, etc.). Depending on the event context, and there may or may not be a Rating_plan (unrated calls like incoming have none), Cell-id, Called/Callee are both required only for more traditional calls/sms, Cell-Id (caller may be roaming outside telco network), etc.    I’m not sure we should model these missing relationships as done with dimensional modelling (specific exception records exist in dimensions to avoid the missing Null FK’s).    Here, I’d recommend to simply leave the Null FK which reflect the way things are recorded at source system.

→Relationship with lifespan (temporal-variant)

DV models Link as (timeless) existence of a relationship, hence they record any association as seen from source, similarly as we do with Hub key.    However, some relationships are meant to have limited lifespan scope.   It then becomes critical to keep track of this lifetime as it validates the time period the relationship holds true. Not much is said about these.... could however find the following (from this cheat sheet online resource):
"We can end-date links using an ‘exists’-attribute in a satellite"

Therefore, we propose to create a Validity lifespan Satellite data around the Link entity to keep track of time period.   This is similar to contextual data around Hub which are always “ time-based”, except now these time-based is no longer metadata but rather functional, i.e. correspond to contextual data of the Links (in fact for this particular Link, no info is conveyed except the relationship itself and its time period validity).   In the Validity Satellite, the attribute Valid-from and Valid-to must correspond to time period, however this rule is not enforced here in case the Billing source-system its business rule in future (subs only has one Rating Plan at any time):

Fig Link between Subs and RatingPlan with limited timespan

→New Hub or Satellite attribute

What justify promoting a single Attribute into a recognized and separate Hub entity? In theory, we learnt that Hub must be an entity recognized in the organization with clear business key(s). However, there are quite a few attributes that have these characteristics: Retail plan, Account type, Promotion, Campaign, etc.... identifiable code associated to them). So I believe we need clearer Design criteria providing guidance to determine what can remain simple Attribute vs what needs to be promoted as a new Hub. I propose these simple criteria:

Pervasiness: Does the Attribute appear in many different source systems/contexts. If yes, then consider the Hub so that we minimize the negative impact of having non-standardized and non-conformed codes spread throughout the EDW data model.
Relationship: Can the Attribute be used in other relationship than its unique associated Hub. If yes, then consider the Hub so that we can capture these relationship and association elsewhere through the standard way : Links.
Evolution: Maybe, a data source newly integrated includes an important business concept still only used, for now, in isolation. If yes, consider the hub as newly fashionable concepts tend to proliferate quite rapidly these days...

Regarding these new design criteria, it is easy to see why an attribute like Retail-Plan (the identifiable code), was modelled as a Hub, since it is frequently used in many contexts (relate to Subscriber, relate to Campaign, can be found in other system, etc.). However, one could state that Address have same characteristics, so that it could as well be a Hub.... yes but unfortunately it rarely meets the 1st Hub design criteria: unique and easily identifiable Business key (unless your a shop specializing with addresses, do you have available in your system a clear and unique Business key for Addresses, most likely not?).

Martin

Thursday, February 28, 2013

Data Vault model: Mobile Telecom example (part-2)

This is the next part from a previous notes where I introduced "Data Vault model: Mobile Telecom example". In this note, I discuss the model following each step of the methodology.

Business Entities

Let’s start first with the identification of some of our key Business entities:

1. Customer

Definition: party that gets billed for the service/product offered by the Telecom
Characteristics

Business Key: BAN (billing account number)
History Evolution: quite stable as one BAN cannot change for the same account however, one subscriber can have different BAN when migrating account type (e.g. prepay to postpay).
Complexity: recursive relationship between BANs often needed to capture BAN hierarchies (BAN → parent BANs → root BAN) for large B2B customers.

2. Subscriber

Definition: individual attached to a particular subscription, i.e. MSISDN (mobile number that provides telecomunnication services).
Characteristics

Business Key: MSISDN, the telephone number given to the individual.
Key alternative: this Business key is not stable, as same subscriber typically changes MSISDN, so Billing systems usually provides an alternative key. This alternate subscriber-ID must be used instead.
Complexity: Not all OLTPs have same identifiable key, any systems not so concerned with history usually tracks the subscriber with its MSISDN at given Timestamp, whereas others come up with more stable key like Subscriber-ID or technical key attached to the SIM card like IMSI.

3. Product

Definition: In telco, the product does not correspond to your typical and tangible goods but rather associated to some networking features leveraged by your handset (e.g. SMS, MMS, call, voice mail, highspeed data, etc..).
Characteristics

Business Key: ProductCode (usually short text not so well standardized throughout the organization).
Complexity: Conformance and data integrity problems as each system may use its own, and usually involves some form of hierarchical representation.

4. Service

Definition: Service is usually tightly coupled with the notion of product. Subscribers purchase service (either inherited through some rating-plan or purchased individually) that include specific Product along with pricing and detail features. (e.g. SMS option, Voice mail international, Local calls special rating, unlimited data volume, free calls number, etc..).
Characteristics

Business Key: ServiceCode (again short text not so well standardized but mostly used by the Billing system).
Complexity: Usually involved some grouping or combination of more detail features related to network and Intelligent network layer (IN), and also linked to rating information used for billing purposes.

5. Handset

Definition: Device through which the subscriber access network product.
Characteristics

Business Key: TAC (standard code describing handset model manufacturer etc.). Not to be confused with IMEI which uniquely identifies each physical device (handset id).

6. Invoice

Definition: Bill statement sent regularly to the Customer for related subscription.
Characteristics

Business Key: Invoice number (used mostly by billing system).
Complexity: typically applicable for a subset of the entire customer population (postpay), as prepay is billed in realtime according to usage

There are obviously many more entities that quickly come to mind (e.g. Sales, Call-Center, Agent, Contact, Campaign, Port-in/out, Visits, etc..), but let’s stop here for the sake of simplicity.

Business Relationship Entities

Again, let’s give out a few examples of relationships that should cover some different modeling scenario. Telco are very complex organization with a large number of heterogenous systems, so these example are clearly over-simplified:

1. Customer hierarchy

Definition: Large B2B customers have usually organizational hierarchy struture for managing employee payment responsibilities and complexity
Characteristics

Data Model name: LNK_CUSTOMER_HIER
Natural recursive relationship: Each Account is associated to a direct parent Account (a single node) until we reach the top account or Root BAN. (BAN → parent BANs → .. → root BAN).
Links modelling: these recursive relationship are best modelled with a parent_key FK’s pointing back on same table. However in DV, linkage structure stands on its own, so these Links must be stored separately. The Child_key acts like the driving key of the relationship (is used to determine the end-dating of the link).

2. Subscriber RatingPlan

Definition: The subscription is associated to some form of rating plan.
Characteristics

Data Model name: LNK_SUBS_RPLAN
Natural relationship: Any subscriber must have one and only one RatingPlan at any point in time. This plan specifies the exact rating of all billable communication items.
Links modelling: this is modelled using a typical Link structure between Subscriber & RatingPlan accompanied with lifespan From/To Date in Satellite (see discussion “Relationship of limited lifespan” ).

3. Billing Invoice statement & Charge item

Definition: Billing statement is invoiced to each customer account at each end of bill cycle.
Characteristics

Data Model name: LNK_INVOICE & LNK_INVOICE_RECHARGE
Natural relationship: The Link relates Bill Statement with the customer, and also with the 1-to-many Subscribers attached to this Customer account. There is also a Link which itemizes each charge by Product (and depending on RatingPlan) associated to one Invoice (LNK_INVOICE). Things may vary here, but let’s keep it simple..
Links modelling: The relationship are regular Link between Customer, Bill-Statement and Subscriber for the Invoice. And for Invoice-charge, we tie it to the Product (item being billed).

4. Recharge Activity

Definition: Prepay subscriber must recharge (top-up) their balance as they go.
Characteristics

Data Model name: LNK_RECHARGE
Natural relationship: This recharge activity is associated to various channel and payment method.
Links modelling: A simple Link structure with FKs referring to Subscriber and Dealers (i.e. the channel through which the subscriber purchased new credit).

5. Network Event (calls, sms, voicemail, internet)

Definition: Transactions corresponding to network events which are consumed by the telco subscriber.
Characteristics

Data Model name: LNK_NETWORK_EVENT
Natural relationship: Depending on the communication product (voice calls, sms, mms, gprs data, umts data, etc..), various entities are involved and can be linked.

Contextual and Descriptive Data

These are meant to convey the time-variant and historical part of each attribute captured by the EDW. Modelling these as Satellites, we allow for timeline slicing and taking snapshot of how things (i.e. business state) were at that time. The main advantage of Satellite, is we usually don't know in advance all features/attributes of interest needed... and even if you knew, things evolve at such pace that most likely you'll need to revisit them soon. With Satellites, you can build them incrementally, bit by bit according to business needs and scope changes without affecting existing structure, as long as you keep them normalized and fine-grained. No longer needed Satellite can simply stop being loaded, while new one get created.

As my goal is not to capture a complete Mobile EDW, I've only included a few Satellite here for
illustration purposes. These would be typical features/attributes associated to Hub/Link entities. They are not meant to be exhaustive but rather convey the meaning of the different entities.

The rest of the note is presented in the part-3, where some details and highlights are presented related to design decision and open issues that got raised during the modelling.

See last part here: Data Vault model: Mobile Telecom example (part-3)

Martin