Showing posts with label Data Warehouse. Show all posts
Showing posts with label Data Warehouse. Show all posts

Monday, March 14, 2016

Cloud vs data warehousing


Things are going quickly when software moves to the Cloud. For the software vendor, there is no more need to maintain x number of code versions to fit every one's preference in hardware. And also the adoption rate can go very fast as the entry cost is pretty much nothing for any new client interested in trying out the offsite platform.

I'm currently spending effort to get up to speed with Cloud-provided DW solution. They could be disruptive to a conservative industry like DW. Here are a few factors I see as impacting the adoption rate.

Cloud computing trend

In recent years, the trend of cloud computing has increased both in scope and popularity to reach nearly ubiquity. It started off with simple needs like storage space, evolved into more advanced offers like computing resource letting user run their own software (PaaS) and has recently moved up the abstraction ladder with complete solution and service stacks being offered (SaaS).

In the realm of Business Intelligence and data warehousing, this trend did not pick up so rapidly (political and security worries) but things are quickly catching up with recent offering like RedShift (Amazon AWS), SQL Data Warehouse (Azure Microsoft) and Elastic Data Warehouse Service (Snowflake). There are many reasons we believe this will grow a lot more, but for the simplest and quickest answer: cost! Doing Business Intelligence and data warehousing projects with on-premise platform is an expensive adventure! (link)

Data Deluge and new Tools

Business intelligence domain has also increased in complexity recently following the advent of what has been called the data deluge. The appearance of a multitude of new data sources from social media, connected device, Internet of Thing, has challenged the relevance and adaptability of traditional BI solutions. BI tools and platforms were designed mostly around data generated from operational system, where data type (structured) and volume were still manageable.

Besides having created its own set of new buzzwords like Big Data, NoSQL, Data Science and Predictive Analytics, current trend has clearly been disruptive in terms of data type (semi-structured and unstructured) and data volumetry/velocity.

The industry and open community were quick to devise new tools borrowing ideas from Google’s scalability architecture (commodity hardware and map reduce paradigm) that focused mostly in handling the sheer data volume and velocity or burst rate. These tools have been loosely defined and grouped under the category NoSQL. No matter how good these tools are at handling the 3V of big data, they all fall short in meeting the most important aspect of any BI analytics solution: the Data integration bit!

The introduction of these new these tools increases significantly BI environment complexity and thus requires more than ever formal and proper architecture blueprint and principles. Otherwise your entire BI platform integrity is jeopardised and will cause on-premise TCO to increase even more. The solution to that is to rely on Cloud solution provider to provide with solid architecture expertise and to offer new cost model with no initial cost investment.

Data Integration & architecture (more relevant than ever)

Anyone today is seeking to extract useful information from raw data in order to gain knowledge and make better and informed decision (i.e. data-driven decision). No matter how valuable your data asset is, there will be many shortcomings if it is kept in its raw format.

Raw data is inconsistent, incomplete, plagued with error, unconstrained, unformatted, outdated, etc. Modeling data imply applying some rules/checks, standardization and harmonization throughout different sources. Data integration implies doing quality assessment and data correction. Hence only data integration and modeling can lead us to “desired” qualities of data, i.e. the five C’s of Data (see Rick Sherman BI guidebook):
  1. Clean (dirty, missing data treated) 
  2. Consistent (knowing which data version is right) 
  3. Conformed (enterprise-wide common and comparable data) 
  4. Current (data latency adapted to context) 
  5. Comprehensive (breaking departmental or section sillos) 

It is real hard work, time-consuming and requires a lot of thinking… but there is no alternative. Don’t believe in any magic solution or silver-bullet tool that promise to transform raw data into actionable information!

We are faced with a multi-source data integration problem, and the new source of data deluge should be integrated and complementing the existing and traditional BI data layers in some form or another (consolidated/aggregated), and not merely bury us with massive and unwieldy data, i.e. the deluge!


Transforming your raw data into knowledge (the "golden" path: raw data -> information -> knowledge) is conditioned on preliminary work that focus on analysing and modeling data. This is especially relevant with semi-structured data found in BigData where no explicit model and constraint exist. There are usually only implicit rules (schema-less) making raw data much harder to work with.

How can Cloud-based solution help you on that...along with your Cloud solution provider, you can find a highly skilled and specialised team dedicated in data integration and the particular Cloud technology. Then you can avoid having to find and train your own team for that purpose. Everyone should focus on its own core competency and avoid building/duplicating specialised skills ..

Martin

Saturday, March 08, 2014

Anchor modeling (part-2)


In previous post, I describe the Anchor modeling approach.  Here I will attempt to comment on its merit and providing a very small example of a data model .

My take on it:

Although, I never implemented AM on a large scale project myself, I can certainly see a few points that are very attractive at first glance.  Here are some that quickly come to mind :
  • The model prescription really goes up to the 6NF (except when we decide to include multiple non-identifying Anchor FK or knotted attribute into a single Tie).  This 6NF implies that no other attributes are stored along with the Anchor surrogate-key, not even the natural-key(s).  Potential benefits are :  
    • Improved maintenance when dealing with evolutive or unstable natural-key.  This follows the principle of model extension where we can accommodate new or change of natural keys in the future.  
  • The modelling prescription details specific and strict rules .
    • Advantage: This reduces the likelihood of our model to go “wrong”.   As a data modeller you are offered much smaller degrees of freedom, at least after you have defined Anchors and Ties.  Modelling Tie may be more flexible, as it seems, you are free to add more or less non-identifying keys in the Relationship. 
  • The model guideline recognises the importance of State often describing relationship (these sort of state are always mutually exclusive and exhaustive).  
    • Advantage: Tie can be Knotted for storing this data in a straightforward way with no need to come up with external structure besides the Knot.   In DV, you’d need external Satellites for storing the attributes and their time-segment.  This feature is convenient as in real-world project we often see this pattern where relationship goes through some form of lifecycle changing state.  
    • Disadvantage:  The flip side of this is that once your relationship is built like that, then it becomes less flexible and can hardly accommodate new stuff...In this respect, DV Links modelled as immutable intersection of Keys offer better longterm flexibility.
  • The model guideline also recognise the importance of having at most one key, called Role, in the relationship outside the Tie key identifiers.  
    • Advantage: For Historical Tie, this makes explicit what changes have triggered an update in the relationship.
  • The model guideline recommends keeping Tie’s width small (#of Roles).   In practice, we should try breaking down larger Tie into smaller constituents.  This helps in regard to : 
    • handling of late arrival facts (or asynchronous) which would delay the capture of the relationship as recording can only happen once all roles are known;
    • decreased data redundancy generated from historised relationships with frequent changing state/property.  In this scenario, lots of nearly identical rows arise since all roles, except the non-identifying state/property role, stay identical.
    • increased stability:  larger relationship are more likely to become deprecated rendering your AM model less evolutive with more frequent replacement of existing Tie with new ones.
  • The Knot is a primitive modeling construct holding important referential data:
    • Advantage: static set text values are easily recognised.  We can avoid disseminating these text values, typical of lesser normalised structures, and thus reducing the data redundancy throughout our data model.
  • The model guideline proposes a clear separation between metadata and data.  Auditing info is stored and maintained externally from data into their own metadata structures.  Metadata is referred to by FK pointing to a global Audit table in a standard and global way.   
    • Maintaining data separate from metadata leads to cleaner data structures and less chance of misinterpreting timestamp data fields.

However an AM implementation has its drawback, the most important one being that we probably need a complete toolset and a separate access layer to help us manage the explosion of data structures (table, views..).  No one would want to interact directly at the physical level any real size AM implementation.

Other important issues could come from DB vendor's limitation which may not support some functions required (ex. Table Elimination is almost mandatory for query optimisation).   Bottom line, you certainly need a lot of experiment before leveraging this technique, and see if the additional complexity can be managed and mitigated in a large scale implementation.


A small Example:

I conclude this post by presenting a small example.   Using the available modeling tool, you can quickly start creating your own model.  So let’s imagine a DB used to collect any info, statistics, indicators/ metrics that are produced about countries and cities in the world.   As these data would need to be refreshed periodically it is important to capture the history as well as sourcing, format, and other meta-info.

The diagram model (each entity type has dedicated appearance) presented next, gives the overview of the data model.  Red squares are used for representing Anchor, circle for Attribute, rounded square for Knot while grey diamond-shaped represents Tie.   Optionally, we may choose to keep Historisation for Tie or Attribute, in which case we have outline represented as double-line.

General view of a simple AM model



The naming and mnemonic convention is used to automatically provide physical tables names (as well as views and functions).    We can toggle between mnemonic or real name in the diagram.

We see below the two main Anchor: City and Country naturally linked through a Tie, as well as with a few other Anchors tied together.
Model zoom-in around Hubs Country and City


Let’s suppose that during the lifetime of our Data repository, we had decided to keep as natural-key the ISO ALPHA-2 (2-letter code).  This was used as the country look-up with the help of Natural-key View.  However later on, we realised that a  lot of data sources actually used the ISO ALPHA-3 (3-letter code) instead as country identifier.   No problem here, our AM model supports transparently the addition of any sort of Attributes.  The only impact is that we now have two Natural-key View to choose from for the look-up.

We later discovered with disbelief, that our natural-keys are NOT immutable… quite annoying for a database key.   As always, this seemed easy to know in hindsight, but at the time who could have guessed that our ISO standard key will betray us.   Our data model must then allow for natural-keys evolution in time!  Again no problem, our AM model has “Historisation by Design” built-in, and the only impact is the addition of a new field for temporal validity inside the natural Attribute tables.  We would also need a slight change to the "Natural-key View"  to accommodate for the time point entry necessary to manage the new look-up logic.

Important note: had we kept our natural-key(s) along the surrogate, these evolutions would have involved breaking existing elements of our model!  This illustrates the benefit of higher normalisation with AM.

Conversely, from our small business case, we clearly see where the 6th NF could cause us real harm… If we start adding new each indicator as one Attribute around the Country and City (and historized as each can be updated in future), there will quickly be an explosion of tables making the entire data model unsustainable!!

To avoid this pitfall we make use of data model generalization principles and create more abstract entities.  So in place of adding new tables to store values for each new indicator, we turned these tables into rows.   In our simple scenario, this becomes possible by defining new abstract entities :  Indicator and the Indicator-instance.      Each Indicator has a clear identity (Anchor: Indicator) and is described by its name, its definition, its source (i.e. its surrounding Attributes).   The Indicator-value is represented as an anchor (IndInstance), whose identity correspond to the Indicator-value given for a particular country (or city) and at a given year.   One  Attribute holds the exact value while a Tie is needed to tie everything together (the indicator, the country and the year-period).

Model entities for Indicator and their instance values


More meta-info could be used in order to locate the information sourcing on the web, and other info related to format, document type.


Martin


Sunday, March 02, 2014

ETL plugins for Data Vault

I recently finished a project whose goal was to provide an easy way to load data into a Database modelled using the Data Vault approaches.    At the time, I had chosen an open source ETL tools coming called PDI from Pentaho's suites.

Unfortunately, due to some issues there was a shift of strategy and this tool would no longer be needed.  In the sake of sharing code, I decided to upload this work to Git-Hub, so anyone willing to use/share/improve could do it freely.  Please note, that the code works as specified however a lot more realistic performance test and improvement is expected.  I know for a fact that some level of caching is necessary for the plugin to work in a realistic data warehousing environment.  Anyone feeling up to it, feel free to fork it!

Update note:
The whole project called "DV-loaders" with code source is now moved to github ( https://github.com/mart2010/pdi-datavault-plugin ).


The rest of this post gives more detail on the tools.


DV-loaders provide custom Transformation steps for loading Data Vault compliant objects: Hub, Link and Satellite. These are developed using plugin extension capability offered by Pentaho Data Integration tool (PDI, aka Kettle).

With these custom Steps, you can:
  1. Load quickly and easily HubLink and Satellite
  2. Define as many Steps as needed inside the same Transformation
  3. Load mandatory attributes compliant with logical Data Vault rules
  4. Load non-mandatory fields as-is (ex. batch-Id for auditing)
  5. Adjust buffer size to fine-tune performance
As an example, consider this simple DV data model:
Then one can load it with this simple Transformation:


INTRODUCTION


Data Vault is a methodology suited for building an integration data layer of the Enterprise Data Warehouse that relies on a simple and repetitive pattern-based approach.
ETL tool gives you access to any data input format and any backend output out of the box, without re-inventing the wheel.
PDI is an open source ETL tool that can be extended by custom plugin
Data Vault + ETL + PDI = DV-Loader plugin
DV-Loader plugin makes it even easier and more performant to load any Data Vault model from data stream processed by PDI.

Details


Features

The PDI DV-Loader plugin offers :
  1. Simplicity
    • easy configuration setting using a consistent ETL design-model
    • same process usable for full AND incremental loading
    • Load Hub and Link "expose" technical key (PK) to downstream steps (usable for dependent Satellites)
    • ETL data flow structure closely follows the DV data model
  2. Robustness
    • fully re-entrant and re-startable ETL process
    • support duplicated record in input stream
    • support unsorted satellite historical records
    • support "a posteriori" satellite historical catch-up
  3. Perfomance
    • leverage JDBC batch processing
    • query lookup done not on individual key but rather using a set of keys to minimize the number of sluggish DB round-trip
    • batch and key lookup size is defined by the parameter Buffer size
  4. Compliancy
    • load mandatory DV fields using DV logic:
      • technical keys
      • business keys
      • satellite fields
      • temporal fields setting satellite record validity
      • audit mandatory fields
    • load other none-mandatory field as-is (pass-through attribute like batch-Id for meta-audit)

Assumptions

A very small number of assumptions is assumed:
  • Hub and Link must have primary key defined as unique sequence integer (support any DB data type used for storing integer)
  • Temporal validity of the Satellite rows are determined through a single temporal field (Date type) available in the incoming record stream (entrant hop)
  • Business keys must all be Not Nullable

Rules

The ETL flow design must respect these logical rules:
  • Hub can be loaded once its business key(s) are available in input record stream
  • Link can be loaded once all referred Hub primary keys are available in input record stream
  • Satellite can be loaded once referred hub (or Link) primary keys are available in input record stream (normally appended upstream by Load-Hub step)


USER GUIDE

ETL data flow

Data model is loaded by attaching Load Steps through hops :
  1. Load Hub
    • Connect Step through a hop containing all business/natural key(s) in input stream
    • Step will look-up business/natural key(s) and append the associated tech-key (Hub's PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  2. Load Link
    • Connect Step through hop containing all tech keys of the relationship
    • Step will look-up Hub tech keys and append the associated Link tech-key (its PK)
      • When found: append the tech-key returned by DB
      • When not found: generate new key (using sequence method defined) and append it
    • Step will "expose" the tech-key in the output stream as <Table-Name.Techkey-Name>
  3. Load Satellite
    • Connect Step through a hop containing the Hub or Link's PK tech-key, the attributes and the "From-Date" temporal attribute controlling satellite record's lifecycle
    • Step will load new satellite record based on different use-case:
      • Temporal Satellite with Idempotent=true: sat record is loaded unless there is an identical consecutive records (default)
      • Temporal Satellite with Idempotent=false: sat record is loaded unless irrespective of consecutive records
      • Static Satellite (no "From-Date" temporal attribute defined): sat record is loaded unless one already exist for the Hub

Setting common to all Step

General Properties:
PropertyDefinition
Step nameName of the Step must be unique within the Transformation
ConnectionDatabase connection to use
Hub/Link/Sat tableTarget table name to load
Buffer sizeThis determines the number of input rows processed at once. Too large value may generate Query lookup or batch insert causing JDBC driver error (typical values are of range 500 or more)
Audit-related Properties:
PropertyDefinition
Sys-creation Date/TimeThe column holding the Timestamp indicating when record was loaded into DB (leave empty when not used
Load Record SourceColumn holding the Audit record source information (leave empty if not used)
Record source valueValue to store in the column "Load Record source" (may be set by a variable substitution)

Setting specific to Step "Load Hub"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type Business/Natural Keyfield(s) corresponding to the business key(s)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.
PK Sequence-Id settingDefine the technical PK column and which sequence generation method to use

Setting specific to Step "Load Link"

PropertyDefinition
Attribute MappingMapping between input stream field and database column
Attribute of type "Relationship Key"field corresponding to the Relationship keys (i.e. each Hub's Primary key)
Other typepass-through field simply loaded as-is (useful for attribute like batch-id, etc.)
PK Sequence-Id settingDefine the technical PK column of the Link and which sequence generation method to use

Setting specific to Step "Load Sat"

PropertyDefinition
Idempotent transformationIdempotent ignores records having identical state at two consecutive time (all attributes are equal). Data Vault standard is Idempotent, but you may have different requirements
Attribute MappingMapping between input stream field and database column
Attribute of type "Foreign-Key to Hub"field representing the Hub's PK (typically provided by an upstream "Load Hub" step)
Attribute of type "From-Date Temporal"field representing the temporal attribute controlling the timeline of each Satellite record. Using an input field instead of a fixed adds more flexibility: if is appropriate, then we simply append it upstream, but other attribute could also be used (ex. when using file input, when using table input, etc..). You use the temporal attribute most adapted for your use-case
Attribute of type Normalfields recorded inside the Satellite. All these control sat record timeline (or lifecycle). Pass-through fields may only be added if their changing values do not impact Satellite lifecycle with regard to the Hub record (the ETL batch-id attribute is a valid example of this)


INSTALLATION

Pre-requisite

 PDI version 5.x

Download and Install

* Download latest archive/package
* Unzip it inside folder: ${PDI_HOME}/plugins/steps

Check installation

* new folder 'DV-loader' should now exist in: ${PDI_HOME}/plugins/steps
* Re-start PDI Spoon UI (${PDI_HOME}/spoon.sh or ${PDI_HOME}/spoon.bat)
* Create a PDI Transformation and add DV-loader steps found under category Experimental:
  • Load Hub
  • Load Link
  • Load Sat

Martin

Friday, January 10, 2014

What is Anchor Modeling? (part-1)



Anchor Modeling (AM) is another approach well suited for modeling the integrated/consolidated data layer within an enterprise data warehouse.  AM was originally created by Olle Regardt with formalisation done by Lars Rönnbäck.  It offers an open source database modeling technique accompanied by an online modeling tool available through an MIT license.  Most info is maintained around AM home.

The approach has some theoretical background with ideas borrowed from the information modeling world.  More specifically it is inspired by a method called Object Role Modeling (ORM) developed by Dr. Terry Halpin.   ORM, not to be confused with the Object Relation Mapping, is a modeling notation designed to help non-expert doing conceptual modeling of database schema.

The theoretical formalisation has also helped provide a very strict and prescriptive methodology compared to some more ad-hoc approaches used in the world of data warehouse (often heuristic-based developed with time and experience).  

It also offers some implementation ecamples, but at this time, the only complete implementation is ported onto the Microsoft SQL-Server engine.


Comparison to Data Vault

I will not go into details as to what is Anchor Modeling here.   Instead I will describe the similarities it shares with Data Vault (DV),  since both aims to decompose any data model into smaller and more basic constituents.    For those interested in detail description, please refer to this excellent article Anchor modeling - agile information modeling in evolving data environments written by its creators.  


1. Anchor ≈ Hub

  • Both Anchor and Hub store Entities having strong identity.   There is an important difference:  in AM, an Anchor stands completely on its own, i.e. it only stores the surrogate-key (aka technical key) so keep natural-key (aka business-key) in separate Attribute table.   
IMO, this offers additional flexibility especially when natural-key(s) are chosen based on operational systems. These tend to have shorter lifecycle than the EDW whose lifetime should be, at least what EDW team hope for,  infinite!    By keeping natural-key(s) outside the Hub, we avoid making any assumption as to what are the natural-key(s) and their data type(s) in future.    AM propose to use a Natural-key View for handling the loading and look-up logic, and any change in natural keys would involve updating this View and adding the new Attribute table only.

    2. Attribute ≈ Satellite

    • Attribute is similar to the Satellite notion in DV.    Both adds contextual and descriptive data which are inherently time-dependent.   However to preserve the 6NF temporal integrity, AM enforces that each Attribute lands in dedicated table.   DV is more relaxed here and accepts any number of attributes in the same Satellite.  
    AM supports either temporal Attribute (referred to as historised in AM modeling front-end tool), or static and immutable Attribute where no temporal validity period is required.   DV modeling guideline assume historisation by default, and requires to store the full time-segment with two distinct time points: Valid-from and Valid-to.  Again the higher-normalised nature of AM restricts us from storing the redundant Valid-to time point (= subsequent Valid-from time point).

    The multi-attributes in DV imply that we must proceed with row comparison to know which one has changed between active record and previous one.  However, I'd argue that AM rule 1-Attribute = 1-table is too strict, and could be relaxed for some exceptional cases.   One example are attributes having strict functional dependency on a master attribute; think of user-friendly text attribute providing descriptive info for technical short code.  Other examples are for immutable attributes guaranteed to
    remain so in the lifetime of the entity they describe.

    3. Tie ≈ Link

    • Tie is similar to the Link notion in DV whose role is to store the relationship between Anchor/Hubs entities .   However Tie does not support over-hanging Attributes the same way Link have their own Satellites.  This limitation is due to the fact that historisation is built-in for Tie as with Attribute.   Switching-on historisation adds Valid-from time point to the Tie's Primary-key, and consequently any Attributes referring to the Tie would be left dangling (or else requires duplicating the Attribute row in violation of 6NF rule).   
    The drawback of AM follows that for any natural many-to-many associations having inherent attributes will involve the creation of a “tied anchor”.   DV is more flexible here, as it allows attaching Satellites to Link.   In DV time-period is not built-in inside the Link structure, so we are left with constructing the history of the relationship ourselves by adding time segment in external Satellite.  This is required when we want to keep track of time-period validity of any relationship.

    Tie may also have some form of attributes like when we deal with relationship having state/role information. Under this circumstance, one can add a Knot attribute to the Tie, becoming a "Knotted-tie" that can now support historical change of its state/role (the knot carries this state information).   With this scheme, you should have at most one Knotted attribute outside the relationship identifier (or another un-identifying relation in extra inside the Tie) for keeping track of the state or role information of the relationship (or the extra role-key tie). 

    4. Knot ≈ Reference 

    • Knot is similar to the Reference notion in DV.    Both of them hold immutable data of lower cardinality.  They normally represent a set of fixed categories, codes or static attributes and are normalised into separate table to avoid update anomalies and duplicating string values throughout the database.   Main difference is that Reference in DV can store multiple strings whereas again in AM only one string code per table is allowed.    It seems though that Knots are first class citizen, whereas in DV, Reference seems more like an ad-hoc optional structure.   Knots also enhance the semantic of its reference, for ex, an Attribute referring a knot becomes a knotted Attribute, a Tie becomes a knotted Tie, both being either static or historised.  



    A small digression on Time


    Temporal aspect of AM:

    The notion of Time in AM is well framed semantically with clear definition.   It provides support of 3 notions of Time:
    1. Happening time.   Corresponds to when some events/transactions occurred: as represented by specific Attribute(s) attached on Anchor
    2. Changing Time.   Corresponds to when some Attribute/Tie (when historised) are valid: as represented by the built-in “Valid-From” Time point in Attribute/Tie.
      1. Note that no redundant “Valid-To” is added to close the time-interval in order to comply with the 6NF requirement.  This avoids the nasty “update” during data loading process, and the potential risk of recording invalid time period.  But this comes at the price of additional complexity for reporting valid temporal attribute at specific time. 
    3. Recording time.  Corresponds to when the data was recorded into our platform (i.e. loading metadata):  as represented by FK’s  referencing a particular log entry in metadata Audit tables.
    Externalising the “Recording time” has the advantage of separating data from metadata, where all timestamp found in the data structures correspond to functional dates whereas metadata is located outside these data structures. 

    This contrasts with DV where metadata-info sits alongside the data.  Although the metadata timestamp “Load_Start” is not supposed to carry any functional meaning, it is confusing that most Query/Report examples given in DV documentation use this metadata timestamp for returning the valid data at given point in time!   Time when data is loaded are rarely correlated with Time of data validity.  This can happen in some occasion, like we have no other alternatives (see case 3 below).   The way I see things, when we need to determine the right functional “Valid-from”, we can be faced with 3 different situations :
    1. Dates do exist explicitly at source, so we load them as-is (typical of data having specified lifespan with effective/expiration date , etc..)
    2. Dates do not exist explicitly at source but can be deducted from some technical dates t source, so we use these technical dates (typical example is source having LastStatusUpdate meta-info allowing us to know when the data has changed)
    3. Dates do not exist explicitly at source neither can they be deducted from other technical dates, so that one is left with the only option of using the Loading time
    Ideally, situation 3 is more the exception than rules in most EDW implementation.


     

    Main advantages

    Most advantages of the AM approach results from its highly normalised structure.  This offers benefits such as:
      • Simpler Data loading involves cheap “insert” and no expensive “update/delete”
        • Fault recovery is more easily done since all rows data have a unique batch load traceability and can be deleted on exceptional situation (when erroneous)
        • Strict "No Null" policy :   null values never appear, thus eliminating difficulties like interpretability, indexing, and other issues.
        • Non-persistent elements never removed but flagged with Knot indicating their state of persistence
        • Maintenance is overall improved and simplified
      • Data model evolution only involves extension of previous version
        • New content always involve incremental addition of either Anchor / Tie / Attribute / Knot structures 
        • These New data structures may, sometimes, render some existing structure obsolete, then we can simply leave them as-is with no onward refresh.
      • Potential Higher Performance (here, it’s more tricky as the higher normalisation can also penalise performance, but the authors have highlighted a few valid points worth mentioning)
        • Narrow Query (hitting small number of attributes) can have very good performance with the aid of Table Elimination
        • Storage is highly efficient since practically no data redundancy exist, only a small number of indexes is actually needed due to the use of clustered table (aka index-organised table in different DB vendor) and also narrowness of table.   On this last point, it'd be nice to see how column-store architecture could be leveraged to exploit the AM unique schema.
        • Data loading practically free from locking and other concurrency issues as only “insert” are processed.
      • Reduce translation logic between different representation:  the unique AM graphical representation is used for both conceptual and logical modeling
        • Furthermore, the physical representation maps directly model entities onto tables, simplifying modeling abstraction levels or even eliminates the need for translation logic. 


    Martin 

    Tuesday, August 06, 2013

    BI “ideal” platform


    When dealing with BI projects, you are much more likely to work under the constraint of existing platform than building one from scratch.   This means having to deal with existing idiosyncrasies, sub-optimal architecture, complex and big data models, tangled ETL loading dependencies and scheduling, confusing or mixing different approaches and architecture, etc.

    This post presents my “ideal” platform blueprint I would use when no existing platform is in place which compromises good design principles.  This ideal BI platform leverages the Multilayered architecture's concept borrowed from the more general Design Pattern in software design engineering.  The advantage of layering is obvious enough:
      1. Visibility:  any layer can only depend on the immediate preceding layer
      2. Functional role:  each layer focus on clear and none contradicting roles
      3. Volatility:
        • Delivery (highest) layer are made up of components that are dependent only on end-user BR changes (i.e. new “business” rule should only affect Delivery layer)
        • Raw Consolidated (lowest) layer are made up of components that change according to technical/system changes happening in operational world
        • Business (middle) layer isolate lowest and highest layers from each other and from their life changing cycle (here the Business-oriented data of the Consolidation layer is the best place to design this in-between layer)
    All layers control and isolate data dependencies :  each layer is only sourced (or dependent) from its direct layer underneath, and layer only expose data to layer above.

    The following diagram depicts an high-level view of this idealistic  BI platform:
     Archi_BI_mod


     Obviously, this diagram’s focus is on simplification, detail of less important are not represented; ex. possible staging areas in between each layer, metadata layer (for ETL business rules, KPI and Report definition, ..),  delivery layer usually has an eclectic choice of architecture based on technologies and tools.


    Source Layer

    Layer Definition
    This layer is simply all systems collecting and generating data.   Not much can be said here as this layer does not fall under the BI group’s responsibility.   However one design criteria note worthy concerns its physical implementation and is meant to avoid disrupting the on-going operational process :
      • We should never target the real “Production” system, but Leverage whenever possible some data replication/copy mechanisms (hardware or software -based)
      • Persisted Staging Area or ODS are different ways for BI to get their own our copy of source data with minimal impact on source


    Integration/Consolidated Layer

    Layer Definition
    This layer is the one closest to the original Inmon’s def: “subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions”.  Some may refer to it using different name like Integrated Layer,  but the fundamental idea remains the same: consolidate and store multi-source data in one database repository.  

    All data generated by the organisation’s activity and business processes (automatic, semi or even manual) are conveniently centralised here.  We can further distinguish two sub-layers :
    1. “Raw” layer which captures all data as-is to ensure full data lineage and traceability
    2. “Business” oriented layer which adds business-provided data (reference,  grouping, hierarchical structures..), and provides some early data transformation for reducing downstream transformation complexity for the Delivery layer.

    Functional Role of Raw Integration-Layer:

    • Provide the first integration and consolidation point of multi-source data
    • Capture and archive all data as-is (no modification nor transformation) : Load 100% of data 100% of the time
    • Provide the full historical view of the detailed data:  “System of records”
    • Identify and load master data source for important and recognized business entity
    • Attach “platform-neutral” Surrogate keys to all natural keys of business entity
    • Align entity across different source
    • Standardise through an easy and repeated data modelling patterns all data

    Functional Role of Business Integration-Layer:

    • Data structure designed to apply business rules transformation
    • Bridge and link non-conformed entities across data sources
    • Provide Grouping, Hierarchies, and other business-provided data
    • Create structure to decrease the Delivery layer transformations complexity
    • Fragment complex transformations rules into smaller, more manageable  and auditable steps
    • Recognise and flag conformed Entity across different data source
    • Flag “erroneous” or “wrong” data  as such (rejecting data to be done in Delivery layer)

     

    Layer-Design Criteria (raw and business)

    • Use a uniform Data modelling technique
      • Do not merely copy and add timestamp/audit data to source OLTP Data models
      • Use a standardized approach and stick to it for every data source integrated
      • Avoid data model diversities and idiosyncrasies typically found in source layer (harder to learn and more costly to absorb)
    • Use a Highly Normalised data structure*
      • Methodology like Data Vault or other similar approaches Anchor-modeling
      • Discourage usage of large data structure (these structures result in important downstream dependency, rigid relationships, and are associated with unstable and hard to maintain ETL logic)
    • Favor Easy Data Loading over Easy Data Reading (compromise to pay)
      • Must be Easy and Quick to load new data structure
      • May be Hard and Messy to query existing data structure
    • Allow for growing organically/incrementally
      • Start small with one source and a few data components
      • Build small incremental step gradually as you ingest more data source
    • Accept gracefully system source changes
      • At best, every change should simply “add” on the existing (no dependencies downstream  impacted)
      • At worst, every change should only affect a small number of “localized” data structures (very limited dependencies impacted)
    • Accept gracefully addition of new source
      • Each new source is integrated independently
      • “Bridge” these independent “universe” (data source) through easy flexible model structure, like the concept of Link in Data Vault.
    • Support full traceability and auditability of source data
      • Don’t “fix” and don’t do any data cleansing
      • Leave “erroneous” or “wrong” data AS-IS  (these come for a reason)
    • Favor understandability through simple and repeatable data model patterns
      • Both raw and business layer must leverage identical data modeling techniques and standard
      • Data model easily recognized and understood and by “unaware” (new) eyes


    *Highly normalized structure :
    To keep track of a temporal database using database normalization, it is well accepted that a normalization less than 6NF has shortcomings (see 6NF Conceptual Models and DW 2.0).  The 6NF form ensures that every non-key gets stored into its dedicated Relation.   In this purest form, we’ll end up with all Relations having one Key plus one non-key attribute, hence making the Relation irreducible  (keeping track of history is simply managed by adding time semantics).  

    Practical approaches normally accept some kind of compromise to this purest form (eg. DV would combine multiple attributes within a single Satellite).   But they still benefit from major advantages of higher normalization:

    +Model much more Resilient to changes in the outside world of the EDW (able to non-destructively evolve the model)
    +Model are Historization-Ready making all attributes potentially temporal (possible by design)
    +Model limits data redundancy even for temporal and time-variant data
    +Model can support more performant Zero Update strategy
    +Model results in the elimination of NULL and support early arriving facts (entity loaded only depend on an Identity

     


    Delivery or Presentation Layer

    Layer Definition
    This layer provides the Business view and/or Departmental view (different version of the truth) of the data.   It is accompanied with Reporting tool and any Data Analytics platform used by business people inside the organisation.

    Delivery-Layer –Functional Role

    • Support complex business rules implementation
    • Provide flexibility to change data transformation rules
    • Offer a sensible view to the business community of all organization’s operations and processes
    • Offer a cleansed and standardized view of the often messy operational data
    • Provide support decision-making tools and platforms
    • Provides data to allow for more advanced analytical and mining activities

    Delivery Layer -Design Criteria

    • Favor simple and easily communicable data model
      • Employ recognized Methodology such as dimensional and star-schema
    • Favor easy Data Reading over hard Data Loading (compromise to pay)
      • Must be Fast and Easy to query data
      • May be Hard and Messy to load data
    • Accept gracefully new Hierarchies, Grouping or other Business provided-info
    • Fully support multiple “version” of the data (departmental or organizational view )
    • Accept gracefully business rules changes (can be frequent)
      • At best, every business change should simply mean to re-construct the Virtualized layer (no re-loading necessary)
      • At worst, every business change should only affect very “localized” data loading jobs
    • Favor Communication of data model tied to Business terms and meaning
      • Data model easily understood and accepted and by “new” business eyes
    • Implementation more dependent on Vendor tools:  follows platform features, choice and characteristics


    Martin

    Friday, March 01, 2013

    Data Vault model: Mobile Telecom example (part-3)

    Here's my last note concerning "Data Vault model: Mobile telecom example" where some points and issues are discussed.


    Open Points and Discussion

    →Business Key inconsistencies
    Among all OLTPs used as source, it is very unlikely that they will all share exact same Key(s) to represent same business entity !   If we are lucky, keys will vary in format only, but more frequently, keys can be totally different.

    This challenge is referred in the DV community as the "Business Key Alignment", where raw keys coming from OLTPs must somehow be integrated (and thus aligned) toward the universal EWBK.
    Next point discusses these points for the most important Key of all Telco:  Subscriber Business Key.

    →Subscriber identification
    In telco, any subscriber is uniquely identifiable through its phone number (MSISDN) but only at given point in time (subscribers can change phone number!).  This makes MSISDN only a valid temporal-based Key.  We are normally more interested in tracking people than number,  then we certainly need an alternative business key.   Billing systems already accounted for these changes of numbers, so these should be used as the Master Data Source for our Subscriber Hub’s Business key.   Other transactional systems in telco, will often make use of temporal-based keys, that is fine in for transaction-based context that always refer to a point in time.   Another example of this is the IMSI which is the unique subscriber number soft-wired at the SIM card level, and somewhat more stable than MSISDN (still an issue when subscriber loses his SIM).

    These points suggest important properties needed for Hub Business Key:
    • Business Key should be immutable
    • Business Key should come from one Master Data Source
    • Business Key may not always be well recognized by business users, especially not at the sacrifice of immutability.
    So here, we decide to choose as Business Key, the Subscriber-Id having these properties, and maintained in our Telco Billing platform (Master data source).

    Now, how should we model the other mutable Keys?  They will likely be needed while we integrate other OLTPs platforms which rely on them.   To be able to map transactions, links coming from these others platforms, we need to record the state of the subscriber mutable keys (ex. MSISDN) according to the chosen Hub key.   As such, we propose to hook them up through the use of dedicated Satellite giving their key values at any point in time (given a Business Key).
    Fig Data model around the Subscriber Hub


    →Satellite Design Criteria
    When it is time to model Satellites around each Hub we have quite a lot of freedom.   We can think of a spectrum modelling style where we have at one extreme One Satellite per Hub, and at the other extreme One Satellite per Attribute (from completely denormalized to highly normalized).

    We can consider these criteria for modelling guidance :
    • Rate of Change (group attributes of similar life-cycle)
    • Types of Data (group attributes of similar nature)
    • Record source system (group attributes by their source) &nbsp

    However, I would add another important criteria: Clarity and Importance.  It is well accepted that some attributes play a much more crucial role than others.  Take for examples, Subscriber which are commonly used to derive important downstream semantics, like business logic, segmentation, grouping, etc (compare these with Attributes such as Address-data like Street Door-no).

    So for these crucial Attributes, we model them in their own separate Table so that we can clearly spot them while browsing our DB schema metadata (see ex. SAT_SUB_STATUS in figure 2).  And more importantly, mixing attributes will lead to additional records (due to historical change tracking).   Our satellite will time slice against attributes of interested, and thus optimized for faster query.


    →Call or Network Event
    How should we model CDR’s data?   Two issues arise when one wants to model these :

    1. Business keys that can possibly identify each record not straightforward (complex and also prone to changes in time)
    2. High data volumetry and burst rate may impose additional constraint on modelling

    Can we classify a Call (Network event) as being a recognized business entity?  It has clearly some characteristics, however it cannot easily be identified through limited business-keys.   One could imagine that the following key combinations:  1) Calling MSISDN, 2) Called MSISDN and 3) Event_start_date.

    These seem to satisfy the uniqueness criteria (at least for scenarios such as receiving voice mail at same time as making another phone call, etc...), but things get much more complicated now with the introduction of smart phone and their multitude protocol and access mode (HTTP, email, VoIP, VPN, ftp, Peer-to-Peer, TV streaming...).   So to me, the Network Event cannot be modelled as a Hub, but rather as a pure Link.

    Now, concerning data volumetry & burst rate.  If we model these as Link and normalize all contextual data into numerous Satellites, we end up with many join operating on extremely large tables !  Furthermore, we can’t even take advantage of star join structure (with much smaller dimension tables having bitmap indexes surrounding the big fact table), as all satellites have the exact same size as the Links tables.  In CDR’s context data are merely static, so there should not be any need to design Satellite to keep track of change in time.

     SO here I'd argue we can loosen Link modeling principle and design what is called a Transaction Link, I think it makes perfect sense for our CDR’s scenario, as illustrated below:

    Fig Network Events are modelled as a Transaction Link 
    Also worth mentioning is that most relationships are optionals, this is typically true when integrating multiple source systems (rated calls from billing, unrated from mediation, etc.).  Depending on the event context, and there may or may not be a Rating_plan (unrated calls like incoming have none), Cell-id, Called/Callee are both required only for more traditional calls/sms, Cell-Id (caller may be roaming outside telco network), etc.    I’m not sure we should model these missing relationships as done with dimensional modelling (specific exception records exist in dimensions to avoid the missing Null FK’s).    Here, I’d recommend to simply leave the Null FK which reflect the way things are recorded at source system.


    →Relationship with lifespan (temporal-variant)

    DV models Link as (timeless) existence of a relationship, hence they record any association as seen from source, similarly as we do with Hub key.    However, some relationships are meant to have limited lifespan scope.   It then becomes critical to keep track of this lifetime as it validates the time period the relationship holds true.  Not much is said about these.... could however find the following (from this cheat sheet online resource):
    "We can end-date links using an ‘exists’-attribute in a satellite"

    Therefore, we propose to create a Validity lifespan Satellite data around the Link entity to keep track of time period.   This is similar to contextual data around Hub which are always “ time-based”, except now these time-based is no longer metadata but rather functional, i.e. correspond to contextual data of the Links (in fact for this particular Link, no info is conveyed except the relationship itself and its time period validity).   In the Validity Satellite, the attribute Valid-from and Valid-to must correspond to time period, however this rule is not enforced here in case the Billing source-system its business rule in future (subs only has one Rating Plan at any time):

    Fig Link between Subs and RatingPlan with limited timespan

    →New Hub or Satellite attribute 

    What justify promoting a single Attribute into a recognized and separate Hub entity?  In theory, we learnt that Hub must be an entity recognized in the organization with clear business key(s).  However, there are quite a few attributes that have these characteristics: Retail plan, Account type, Promotion, Campaign, etc.... identifiable code associated to them).  So I believe we need clearer Design criteria providing guidance to determine what can remain simple Attribute vs what needs to be promoted as a new Hub.  I propose these simple criteria:

    1. Pervasiness: Does the Attribute appear in many different source systems/contexts. If yes, then consider the Hub so that we minimize the negative impact of having non-standardized and non-conformed codes spread throughout the EDW data model. 
    2. Relationship: Can the Attribute be used in other relationship than its unique associated Hub. If yes, then consider the Hub so that we can capture these relationship and association elsewhere through the standard way : Links. 
    3. Evolution: Maybe, a data source newly integrated includes an important business concept still only used, for now, in isolation. If yes, consider the hub as newly fashionable concepts tend to proliferate quite rapidly these days...


    Regarding these new design criteria, it is easy to see why an attribute like Retail-Plan (the identifiable code), was modelled as a Hub, since it is frequently used in many contexts (relate to Subscriber, relate to Campaign, can be found in other system, etc.).  However, one could state that Address have same characteristics, so that it could as well be a Hub.... yes but unfortunately it rarely meets the 1st Hub design criteria: unique and easily identifiable Business key (unless your a shop specializing with addresses, do you have available in your system a clear and unique Business key for Addresses, most likely not?).

    Martin

    Thursday, February 28, 2013

    Data Vault model: Mobile Telecom example (part-2)

    This is the next part from a previous notes where I introduced "Data Vault model: Mobile Telecom example".  In this note, I discuss the model following each step of the methodology.  

    Business Entities

    Let’s start first with the identification of some of our key Business entities:

    1. Customer

    • Definition: party that gets billed for the service/product offered by the Telecom
    • Characteristics
      • Business Key: BAN (billing account number)
      • History Evolution: quite stable as one BAN cannot change for the same account however, one subscriber can have different BAN when migrating account type (e.g. prepay to postpay).
      • Complexity: recursive relationship between BANs often needed to capture BAN hierarchies (BAN → parent BANs → root BAN) for large B2B customers.

    2. Subscriber

    • Definition: individual attached to a particular subscription, i.e. MSISDN (mobile number that provides telecomunnication services).
    • Characteristics
      • Business Key: MSISDN, the telephone number given to the individual.
      • Key alternative: this Business key is not stable, as same subscriber typically changes MSISDN, so Billing systems usually provides an alternative key.  This alternate subscriber-ID must be used instead.
      • Complexity: Not all OLTPs have same identifiable key, any systems not so concerned with history usually tracks the subscriber with its MSISDN at given Timestamp, whereas others come up with more stable key like Subscriber-ID or technical key attached to the SIM card like IMSI.

    3. Product

    • Definition:  In telco, the product does not correspond to your typical and tangible goods but rather associated to some networking features leveraged by your handset (e.g. SMS, MMS, call, voice mail, highspeed data, etc..).
    • Characteristics
      • Business Key: ProductCode (usually short text not so well standardized throughout the organization).
      • Complexity: Conformance and data integrity problems as each system may use its own, and usually involves some form of hierarchical representation.

    4. Service

    • Definition:  Service is usually tightly coupled with the notion of product.  Subscribers purchase service (either inherited through some rating-plan or purchased individually) that include specific Product along with pricing and detail features.  (e.g. SMS option, Voice mail international, Local calls special rating, unlimited data volume, free calls number, etc..).
    • Characteristics
      • Business Key ServiceCode (again short text not so well standardized but mostly used by the Billing system).
      • Complexity: Usually involved some grouping or combination of more detail features related to network and Intelligent network layer (IN), and also linked to rating information used for billing purposes.

    5. Handset

    • Definition:  Device through which the subscriber access network product.
    • Characteristics
      • Business Key TAC (standard code describing handset model manufacturer etc.).  Not to be confused with IMEI which uniquely identifies each physical device (handset id).

    6. Invoice

    • Definition:  Bill statement sent regularly to the Customer for related subscription.
    • Characteristics
      • Business Key Invoice number (used mostly by billing system).
      • Complexity: typically applicable for a subset of the entire customer population (postpay), as prepay is billed in realtime according to usage
    There are obviously many more entities that quickly come to mind (e.g. Sales, Call-Center, Agent, Contact, Campaign, Port-in/out, Visits, etc..), but let’s stop here for the sake of simplicity.

      Business Relationship Entities

      Again, let’s give out a few examples of relationships that should cover some different modeling scenario.  Telco are very complex organization with a large number of heterogenous systems, so these example are clearly over-simplified:

      1. Customer hierarchy

      • Definition: Large B2B customers have usually organizational hierarchy struture for managing employee payment responsibilities and complexity
      • Characteristics
        • Data Model name: LNK_CUSTOMER_HIER
        • Natural recursive relationship: Each Account is associated to a direct parent Account (a single node) until we reach the top account or Root BAN.  (BAN → parent BANs → .. → root BAN).
        • Links modelling: these recursive relationship are best modelled with a parent_key FK’s pointing back on same table.  However in DV, linkage structure stands on its own, so these Links must be stored separately.  The Child_key acts like the driving key of the relationship (is used to determine the end-dating of the link).

      2. Subscriber RatingPlan

      • Definition: The subscription is associated to some form of rating plan.
      • Characteristics
        • Data Model name: LNK_SUBS_RPLAN
        • Natural relationship: Any subscriber must have one and only one RatingPlan at any point in time.  This plan specifies the exact rating of all billable communication items.
        • Links modelling:  this is modelled using a typical Link structure between Subscriber & RatingPlan accompanied with lifespan From/To Date in Satellite (see discussion “Relationship of limited lifespan” ).

      3. Billing Invoice statement & Charge item

      • Definition:   Billing statement is invoiced to each customer account at each end of bill cycle.
      • Characteristics
        • Data Model name: LNK_INVOICE & LNK_INVOICE_RECHARGE
        • Natural relationship:  The Link relates Bill Statement with the customer, and also with the 1-to-many Subscribers attached to this Customer account.    There is also a Link which itemizes each charge by Product (and depending on RatingPlan) associated to one Invoice (LNK_INVOICE).   Things may vary here, but let’s keep it simple..
        • Links modelling:  The relationship are regular Link between Customer, Bill-Statement and Subscriber for the Invoice.  And for Invoice-charge, we tie it to the Product (item being billed).

      4. Recharge Activity

      • Definition:  Prepay subscriber must recharge (top-up) their balance as they go.
      • Characteristics
        • Data Model name: LNK_RECHARGE
        • Natural relationship:  This recharge activity is associated to various channel and payment method.
        • Links modelling:  A simple Link structure with FKs referring to Subscriber and Dealers (i.e. the channel through which the subscriber purchased new credit).

      5. Network Event (calls, sms, voicemail, internet)

      • Definition:  Transactions corresponding to network events which are consumed by the telco subscriber.
      • Characteristics
        • Data Model name: LNK_NETWORK_EVENT
        • Natural relationship Depending on the communication product (voice calls, sms, mms, gprs data, umts data, etc..), various entities are involved and can be linked.

        Contextual and Descriptive Data

        These are meant to convey the time-variant and historical part of each attribute captured by the EDW.   Modelling these as Satellites, we allow for timeline slicing and taking snapshot of how things (i.e. business state) were at that time.   The main advantage of Satellite, is we usually don't know in advance all features/attributes of interest needed... and even if you knew, things evolve at such pace that most likely you'll need to revisit them soon.    With Satellites, you can build them incrementally, bit by bit according to business needs and scope changes without affecting existing structure, as long as you keep them normalized and fine-grained.   No longer needed Satellite can simply stop being loaded, while new one get created.

        As my goal is not to capture a complete Mobile EDW,  I've only included a few Satellite here for
        illustration purposes.    These would be typical features/attributes associated to Hub/Link entities.  They are not meant to be exhaustive but rather convey the meaning of the different entities.

        The rest of the note is presented in the part-3, where some details and highlights are presented related to design decision and open issues that got raised during the modelling.

        See last part here: Data Vault model: Mobile Telecom example (part-3)


        Martin