Tuesday, August 06, 2013

BI “ideal” platform


When dealing with BI projects, you are much more likely to work under the constraint of existing platform than building one from scratch.   This means having to deal with existing idiosyncrasies, sub-optimal architecture, complex and big data models, tangled ETL loading dependencies and scheduling, confusing or mixing different approaches and architecture, etc.

This post presents my “ideal” platform blueprint I would use when no existing platform is in place which compromises good design principles.  This ideal BI platform leverages the Multilayered architecture's concept borrowed from the more general Design Pattern in software design engineering.  The advantage of layering is obvious enough:
    1. Visibility:  any layer can only depend on the immediate preceding layer
    2. Functional role:  each layer focus on clear and none contradicting roles
    3. Volatility:
      • Delivery (highest) layer are made up of components that are dependent only on end-user BR changes (i.e. new “business” rule should only affect Delivery layer)
      • Raw Consolidated (lowest) layer are made up of components that change according to technical/system changes happening in operational world
      • Business (middle) layer isolate lowest and highest layers from each other and from their life changing cycle (here the Business-oriented data of the Consolidation layer is the best place to design this in-between layer)
All layers control and isolate data dependencies :  each layer is only sourced (or dependent) from its direct layer underneath, and layer only expose data to layer above.

The following diagram depicts an high-level view of this idealistic  BI platform:
 Archi_BI_mod


 Obviously, this diagram’s focus is on simplification, detail of less important are not represented; ex. possible staging areas in between each layer, metadata layer (for ETL business rules, KPI and Report definition, ..),  delivery layer usually has an eclectic choice of architecture based on technologies and tools.


Source Layer

Layer Definition
This layer is simply all systems collecting and generating data.   Not much can be said here as this layer does not fall under the BI group’s responsibility.   However one design criteria note worthy concerns its physical implementation and is meant to avoid disrupting the on-going operational process :
    • We should never target the real “Production” system, but Leverage whenever possible some data replication/copy mechanisms (hardware or software -based)
    • Persisted Staging Area or ODS are different ways for BI to get their own our copy of source data with minimal impact on source


Integration/Consolidated Layer

Layer Definition
This layer is the one closest to the original Inmon’s def: “subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions”.  Some may refer to it using different name like Integrated Layer,  but the fundamental idea remains the same: consolidate and store multi-source data in one database repository.  

All data generated by the organisation’s activity and business processes (automatic, semi or even manual) are conveniently centralised here.  We can further distinguish two sub-layers :
  1. “Raw” layer which captures all data as-is to ensure full data lineage and traceability
  2. “Business” oriented layer which adds business-provided data (reference,  grouping, hierarchical structures..), and provides some early data transformation for reducing downstream transformation complexity for the Delivery layer.

Functional Role of Raw Integration-Layer:

  • Provide the first integration and consolidation point of multi-source data
  • Capture and archive all data as-is (no modification nor transformation) : Load 100% of data 100% of the time
  • Provide the full historical view of the detailed data:  “System of records”
  • Identify and load master data source for important and recognized business entity
  • Attach “platform-neutral” Surrogate keys to all natural keys of business entity
  • Align entity across different source
  • Standardise through an easy and repeated data modelling patterns all data

Functional Role of Business Integration-Layer:

  • Data structure designed to apply business rules transformation
  • Bridge and link non-conformed entities across data sources
  • Provide Grouping, Hierarchies, and other business-provided data
  • Create structure to decrease the Delivery layer transformations complexity
  • Fragment complex transformations rules into smaller, more manageable  and auditable steps
  • Recognise and flag conformed Entity across different data source
  • Flag “erroneous” or “wrong” data  as such (rejecting data to be done in Delivery layer)

 

Layer-Design Criteria (raw and business)

  • Use a uniform Data modelling technique
    • Do not merely copy and add timestamp/audit data to source OLTP Data models
    • Use a standardized approach and stick to it for every data source integrated
    • Avoid data model diversities and idiosyncrasies typically found in source layer (harder to learn and more costly to absorb)
  • Use a Highly Normalised data structure*
    • Methodology like Data Vault or other similar approaches Anchor-modeling
    • Discourage usage of large data structure (these structures result in important downstream dependency, rigid relationships, and are associated with unstable and hard to maintain ETL logic)
  • Favor Easy Data Loading over Easy Data Reading (compromise to pay)
    • Must be Easy and Quick to load new data structure
    • May be Hard and Messy to query existing data structure
  • Allow for growing organically/incrementally
    • Start small with one source and a few data components
    • Build small incremental step gradually as you ingest more data source
  • Accept gracefully system source changes
    • At best, every change should simply “add” on the existing (no dependencies downstream  impacted)
    • At worst, every change should only affect a small number of “localized” data structures (very limited dependencies impacted)
  • Accept gracefully addition of new source
    • Each new source is integrated independently
    • “Bridge” these independent “universe” (data source) through easy flexible model structure, like the concept of Link in Data Vault.
  • Support full traceability and auditability of source data
    • Don’t “fix” and don’t do any data cleansing
    • Leave “erroneous” or “wrong” data AS-IS  (these come for a reason)
  • Favor understandability through simple and repeatable data model patterns
    • Both raw and business layer must leverage identical data modeling techniques and standard
    • Data model easily recognized and understood and by “unaware” (new) eyes


*Highly normalized structure :
To keep track of a temporal database using database normalization, it is well accepted that a normalization less than 6NF has shortcomings (see 6NF Conceptual Models and DW 2.0).  The 6NF form ensures that every non-key gets stored into its dedicated Relation.   In this purest form, we’ll end up with all Relations having one Key plus one non-key attribute, hence making the Relation irreducible  (keeping track of history is simply managed by adding time semantics).  

Practical approaches normally accept some kind of compromise to this purest form (eg. DV would combine multiple attributes within a single Satellite).   But they still benefit from major advantages of higher normalization:

+Model much more Resilient to changes in the outside world of the EDW (able to non-destructively evolve the model)
+Model are Historization-Ready making all attributes potentially temporal (possible by design)
+Model limits data redundancy even for temporal and time-variant data
+Model can support more performant Zero Update strategy
+Model results in the elimination of NULL and support early arriving facts (entity loaded only depend on an Identity

 


Delivery or Presentation Layer

Layer Definition
This layer provides the Business view and/or Departmental view (different version of the truth) of the data.   It is accompanied with Reporting tool and any Data Analytics platform used by business people inside the organisation.

Delivery-Layer –Functional Role

  • Support complex business rules implementation
  • Provide flexibility to change data transformation rules
  • Offer a sensible view to the business community of all organization’s operations and processes
  • Offer a cleansed and standardized view of the often messy operational data
  • Provide support decision-making tools and platforms
  • Provides data to allow for more advanced analytical and mining activities

Delivery Layer -Design Criteria

  • Favor simple and easily communicable data model
    • Employ recognized Methodology such as dimensional and star-schema
  • Favor easy Data Reading over hard Data Loading (compromise to pay)
    • Must be Fast and Easy to query data
    • May be Hard and Messy to load data
  • Accept gracefully new Hierarchies, Grouping or other Business provided-info
  • Fully support multiple “version” of the data (departmental or organizational view )
  • Accept gracefully business rules changes (can be frequent)
    • At best, every business change should simply mean to re-construct the Virtualized layer (no re-loading necessary)
    • At worst, every business change should only affect very “localized” data loading jobs
  • Favor Communication of data model tied to Business terms and meaning
    • Data model easily understood and accepted and by “new” business eyes
  • Implementation more dependent on Vendor tools:  follows platform features, choice and characteristics


Martin

6 comments:

Unknown said...

Great Post.

I personally try to distinguish between *logistical* layers (staging,EDW, Dimensional Store) and *logical abstraction* layers: Central EDW, Access/Data Mart Layers).

IMO The Rule (Vault) is not a layer in either sense, more of an *area* inside the central EDW.

Realize that all data entities can store both base and derived data (both over time and at the same time). This does not have to interfere with aspects like auditability/tracability and flexibility.

Martin Ouellet said...

Thanks Martijn.

I agree that "The Rule" is not be a layer per se...

However favoring highly normalized model in central EDW would lead to highly decomposed entities, so not sure it should hold both base and derived data?

Unknown said...

Ít is a good thing to store derived data/calculations as atomic as possible, this makes it more managable.

IMO the central layer sees any business rule/derivation just as another (very small) data source, that happens to get its data from the raw vault and process it first instead of being external.

Tobias Maasland said...
This comment has been removed by the author.
Tobias Maasland said...

Hey Martin,

I think your post is very helpful from an architectorial point of view. Especially talking about those layers, this article is still "fresh" today. Also, it helps to read this to get a general abstract idea of modelling a DWH.

When I wrote about some misconceptions about Star Schemas being perceived as Data Warehouses, I referred to this post for clarification: https://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/

Martin Ouellet said...

Thanks Tobias,
Yes I wrote this post out of frustrations after seeing too many DW implementations having a "missing" integration layer (or persistant ... or what ever else you want to call it), and all the (avoidable) difficulties it generates!