Martin Ouellet Notes

Tuesday, August 06, 2013

BI “ideal” platform

When dealing with BI projects, you are much more likely to work under the constraint of existing platform than building one from scratch. This means having to deal with existing idiosyncrasies, sub-optimal architecture, complex and big data models, tangled ETL loading dependencies and scheduling, confusing or mixing different approaches and architecture, etc.

This post presents my “ideal” platform blueprint I would use when no existing platform is in place which compromises good design principles. This ideal BI platform leverages the Multilayered architecture's concept borrowed from the more general Design Pattern in software design engineering. The advantage of layering is obvious enough:

Visibility: any layer can only depend on the immediate preceding layer
Functional role: each layer focus on clear and none contradicting roles
Volatility:
- Delivery (highest) layer are made up of components that are dependent only on end-user BR changes (i.e. new “business” rule should only affect Delivery layer)
- Raw Consolidated (lowest) layer are made up of components that change according to technical/system changes happening in operational world
- Business (middle) layer isolate lowest and highest layers from each other and from their life changing cycle (here the Business-oriented data of the Consolidation layer is the best place to design this in-between layer)

All layers control and isolate data dependencies : each layer is only sourced (or dependent) from its direct layer underneath, and layer only expose data to layer above.

The following diagram depicts an high-level view of this idealistic BI platform:

Obviously, this diagram’s focus is on simplification, detail of less important are not represented; ex. possible staging areas in between each layer, metadata layer (for ETL business rules, KPI and Report definition, ..), delivery layer usually has an eclectic choice of architecture based on technologies and tools.

Source Layer

Layer Definition

This layer is simply all systems collecting and generating data. Not much can be said here as this layer does not fall under the BI group’s responsibility. However one design criteria note worthy concerns its physical implementation and is meant to avoid disrupting the on-going operational process :

We should never target the real “Production” system, but Leverage whenever possible some data replication/copy mechanisms (hardware or software -based)
Persisted Staging Area or ODS are different ways for BI to get their own our copy of source data with minimal impact on source

Integration/Consolidated Layer

Layer Definition
This layer is the one closest to the original Inmon’s def: “subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions”. Some may refer to it using different name like Integrated Layer, but the fundamental idea remains the same: consolidate and store multi-source data in one database repository.

All data generated by the organisation’s activity and business processes (automatic, semi or even manual) are conveniently centralised here. We can further distinguish two sub-layers :

“Raw” layer which captures all data as-is to ensure full data lineage and traceability
“Business” oriented layer which adds business-provided data (reference, grouping, hierarchical structures..), and provides some early data transformation for reducing downstream transformation complexity for the Delivery layer.

Functional Role of Raw Integration-Layer:

Provide the first integration and consolidation point of multi-source data
Capture and archive all data as-is (no modification nor transformation) : Load 100% of data 100% of the time
Provide the full historical view of the detailed data: “System of records”
Identify and load master data source for important and recognized business entity
Attach “platform-neutral” Surrogate keys to all natural keys of business entity
Align entity across different source
Standardise through an easy and repeated data modelling patterns all data

Functional Role of Business Integration-Layer:

Data structure designed to apply business rules transformation
Bridge and link non-conformed entities across data sources
Provide Grouping, Hierarchies, and other business-provided data
Create structure to decrease the Delivery layer transformations complexity
Fragment complex transformations rules into smaller, more manageable and auditable steps
Recognise and flag conformed Entity across different data source
Flag “erroneous” or “wrong” data as such (rejecting data to be done in Delivery layer)

Layer-Design Criteria (raw and business)

Use a uniform Data modelling technique
- Do not merely copy and add timestamp/audit data to source OLTP Data models
- Use a standardized approach and stick to it for every data source integrated
- Avoid data model diversities and idiosyncrasies typically found in source layer (harder to learn and more costly to absorb)

Use a Highly Normalised data structure*
- Methodology like Data Vault or other similar approaches Anchor-modeling
- Discourage usage of large data structure (these structures result in important downstream dependency, rigid relationships, and are associated with unstable and hard to maintain ETL logic)

Favor Easy Data Loading over Easy Data Reading (compromise to pay)
- Must be Easy and Quick to load new data structure
- May be Hard and Messy to query existing data structure

Allow for growing organically/incrementally
- Start small with one source and a few data components
- Build small incremental step gradually as you ingest more data source

Accept gracefully system source changes
- At best, every change should simply “add” on the existing (no dependencies downstream impacted)
- At worst, every change should only affect a small number of “localized” data structures (very limited dependencies impacted)

Accept gracefully addition of new source
- Each new source is integrated independently
- “Bridge” these independent “universe” (data source) through easy flexible model structure, like the concept of Link in Data Vault.

Support full traceability and auditability of source data
- Don’t “fix” and don’t do any data cleansing
- Leave “erroneous” or “wrong” data AS-IS (these come for a reason)

Favor understandability through simple and repeatable data model patterns
- Both raw and business layer must leverage identical data modeling techniques and standard
- Data model easily recognized and understood and by “unaware” (new) eyes

*Highly normalized structure :

To keep track of a temporal database using database normalization, it is well accepted that a normalization less than 6NF has shortcomings (see 6NF Conceptual Models and DW 2.0). The 6NF form ensures that every non-key gets stored into its dedicated Relation. In this purest form, we’ll end up with all Relations having one Key plus one non-key attribute, hence making the Relation irreducible (keeping track of history is simply managed by adding time semantics).

Practical approaches normally accept some kind of compromise to this purest form (eg. DV would combine multiple attributes within a single Satellite). But they still benefit from major advantages of higher normalization:

+Model much more Resilient to changes in the outside world of the EDW (able to non-destructively evolve the model)
+Model are Historization-Ready making all attributes potentially temporal (possible by design)
+Model limits data redundancy even for temporal and time-variant data
+Model can support more performant Zero Update strategy
+Model results in the elimination of NULL and support early arriving facts (entity loaded only depend on an Identity)

Delivery or Presentation Layer

Layer Definition
This layer provides the Business view and/or Departmental view (different version of the truth) of the data. It is accompanied with Reporting tool and any Data Analytics platform used by business people inside the organisation.

Delivery-Layer –Functional Role

Support complex business rules implementation
Provide flexibility to change data transformation rules
Offer a sensible view to the business community of all organization’s operations and processes
Offer a cleansed and standardized view of the often messy operational data
Provide support decision-making tools and platforms
Provides data to allow for more advanced analytical and mining activities

Delivery Layer -Design Criteria

Favor simple and easily communicable data model
- Employ recognized Methodology such as dimensional and star-schema
Favor easy Data Reading over hard Data Loading (compromise to pay)
- Must be Fast and Easy to query data
- May be Hard and Messy to load data
Accept gracefully new Hierarchies, Grouping or other Business provided-info
Fully support multiple “version” of the data (departmental or organizational view )
Accept gracefully business rules changes (can be frequent)
- At best, every business change should simply mean to re-construct the Virtualized layer (no re-loading necessary)
- At worst, every business change should only affect very “localized” data loading jobs
Favor Communication of data model tied to Business terms and meaning
- Data model easily understood and accepted and by “new” business eyes
Implementation more dependent on Vendor tools: follows platform features, choice and characteristics

Martin

Wednesday, May 08, 2013

NoSQL: how disruptive for BI?

It seems the "NoSQL" term is now seen and used everywhere. This post discusses this new technology, along with its impact in relation to BI world.

Definition

Trying to find a clear and consensual definition is always a challenge with over-hyped and buzz words more used as a marketing advantage. Recently, well respected authors Martin Folwer and Pramod Sadalage wrote a clear and informative book on the full spectrum of noSQL technologies called "NoSQL Distilled: a brief guide to the emerging world of polyglot persistence".

According to the book author, no well accepted definition exists for NoSQL, the original name from the NoSQL Meetup was "open source distributed nonrelational databases". Rather than focusing on definition, the authors rather focus on a few common points all shared by NoSQL databases:

Often run off commodity cluster computing
Support of large or extremely large data volumes on clusters
Query language is not necessarily SQL (usually a close derivative language is used)
Usually Open-source and prominent in Web-based applications ecosystems
consistency is not done relying on relational ACID principles (an effect of running off independent cluster nodes)
Use Schema-less, i.e. no static data model structure mandatory

What about Cluster-based Solution proposed by Relational vendor
The commodity cluster models lead to a much cheaper and more scalable solution (scaling-out with much small nodes) than the scale-up alternative (ever increasing size of big monolithic and expansive server). The cluster version of relational databases (ex. Oracle RAC or Microsoft SQL-server) is not designed to be run off multiple independent nodes, as it actually relies on a common and share disk subsystem. Their model leads to fast and performant throughput with cluster-aware filesystem writing to a single highly available disk subsystem. This single point of failure is avoided in robust commodity cluster where a server node may go down (both server hardware internals & disk)) without impacting the cluster on-going execution (full failover and high availability is guaranteed).

NoSQL database: Metamodel classification

The author proposed a neat and convenient way to classify current NoSQL database according to how they store data, i.e. the metamodel used. The relational metamodel is based on relations (tables) and tupple (rows), current NoSQL uses rather these forms:

Key-value: Data stored at Aggregation* level using an "opaque" form

Each aggregate has a key or ID to get at the data
Each aggregate is opaque as data is stored using unstructured blob
(+) use to store whatever we like with just a constraint on size
(-) less flexible in querying & updating (cannot be done on part of a value)

Document: Data stored at Aggregation level using a visible and explicit semi-structured form (typical of JSON)

Each aggregate may have a key or ID to get at the data, but not necessarily as query can be done on any fields within the aggregate
Each aggregate is stored within a defined structure imposing a constraint on what's data is allowed
(+) more flexible in querying (part of a value can be queried or updated)

Columnar: Data is stored using a two-level map with grouping of columns including all rows.

Data is identified with a row-ID followed by a column-name normally part of a column-family
Query sequence pattern: Row-key --> Column-key --> "values"
(+) optimal with read-oriented app (ex. BI)
(-) not ideal for writing-oriented app

Graph: Special database using graph metamodel (i.e. nodes & edges) to store records with complex interconnections

Each entity or nodes can be queried in relation to interconnected nodes typical of social networks, product preferences..
(+) querying is optimal compared to the relational model which require FK relationship and expansive join operation
(-) focus on querying results in more complex & less performant data insertion
(-) less suited with storage across distributed cluster nodes

*The term aggregation is used to refer to data unit more complex than just a list of values or tuples. Typically, data unit is consisting of a complex set of records with associations and nested structure, all of which must be manipulated atomically (ex. think about one customer with all related addresses, or Order with all item details, payment info and customer attached to it).

Technologies

Looking at impressive number of open source implementations available (the list just names a few out of the 150 databases currently listed here), we have an indication of the potential growth of this technology, but also an indication of its maturity level... only a few important implementations could likely survive!

A lot of NoSQL database stores may not have clear-cut storage metamodel fitting exactly the definition of M. Fowler et al. metamodel taxonomy. So depending on interpretation, we may find some stores in various categories, please refer to NoSQL online reference for other interpretation and up-to-date data on this active field.

Key-value Store:

Redis
Voldemort
Oracle BDB
Amazon SimpleDB
Riak

Document Store:

Apache CouchDB
MongoDB
BaseX

Columnar or Column store:

Cassandra
HBase (Hadoop)
HyperTable
Apache Accumulo
MonetDB

Graph Database:

Neo4J
InfoGrid
Infinite Graph

BI perspective

I consider Graph database and Column-store to have the greatest impact in BI and to have the potential of being disruptive to current BI market.

For Column-store, this is easily understandable as BI is lot about scanning and other read-only data access where often just a few fields may be fetched out of much wider and flat record. As a testimony of this, we can look at the number of commercial vendors purchasing existing solution and/or integrating column-store into their own platform :

Sybase IQ now part of SAP, probably one of the original pioneer of columnstore in commercial scene
Vertica (purchased by HP), the 5-year old vendor specialized and fully dedicated to column-store
Greenplum (purchased by EMC) with their "polymorphic" data storage
Paraccel (very recently purchased by Actian) combine column-store with an in-memory engine
Oracle with a twist that they refer to as Hybrid columnar (rows still being used)
Microsoft SQL-Server proposes a compromised approach called "columnstore index" where data update is no longer permitted!

As for Graph database, I think great opportunity is possible by mining the relationship and interactions contained explicitly or implicitly within these graph content. Lots of research are currently done on work aiming at providing BI-style analysis against Graph-based data. Here're two great examples: http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/beweb2012/a3-bleco.pdf or else http://perso.ecp.fr/~aufaurema/Graphs%20for%20BI%20(printable).pdf

Martin

Friday, March 01, 2013

Data Vault model: Mobile Telecom example (part-3)

Here's my last note concerning "Data Vault model: Mobile telecom example" where some points and issues are discussed.

Open Points and Discussion

→Business Key inconsistencies
Among all OLTPs used as source, it is very unlikely that they will all share exact same Key(s) to represent same business entity ! If we are lucky, keys will vary in format only, but more frequently, keys can be totally different.

This challenge is referred in the DV community as the "Business Key Alignment", where raw keys coming from OLTPs must somehow be integrated (and thus aligned) toward the universal EWBK.
Next point discusses these points for the most important Key of all Telco: Subscriber Business Key.

→Subscriber identification
In telco, any subscriber is uniquely identifiable through its phone number (MSISDN) but only at given point in time (subscribers can change phone number!). This makes MSISDN only a valid temporal-based Key. We are normally more interested in tracking people than number, then we certainly need an alternative business key. Billing systems already accounted for these changes of numbers, so these should be used as the Master Data Source for our Subscriber Hub’s Business key. Other transactional systems in telco, will often make use of temporal-based keys, that is fine in for transaction-based context that always refer to a point in time. Another example of this is the IMSI which is the unique subscriber number soft-wired at the SIM card level, and somewhat more stable than MSISDN (still an issue when subscriber loses his SIM).

These points suggest important properties needed for Hub Business Key:

Business Key should be immutable
Business Key should come from one Master Data Source
Business Key may not always be well recognized by business users, especially not at the sacrifice of immutability.

So here, we decide to choose as Business Key, the Subscriber-Id having these properties, and maintained in our Telco Billing platform (Master data source).

Now, how should we model the other mutable Keys? They will likely be needed while we integrate other OLTPs platforms which rely on them. To be able to map transactions, links coming from these others platforms, we need to record the state of the subscriber mutable keys (ex. MSISDN) according to the chosen Hub key. As such, we propose to hook them up through the use of dedicated Satellite giving their key values at any point in time (given a Business Key).

Fig Data model around the Subscriber Hub

→Satellite Design Criteria
When it is time to model Satellites around each Hub we have quite a lot of freedom. We can think of a spectrum modelling style where we have at one extreme One Satellite per Hub, and at the other extreme One Satellite per Attribute (from completely denormalized to highly normalized).

We can consider these criteria for modelling guidance :

Rate of Change (group attributes of similar life-cycle)
Types of Data (group attributes of similar nature)
Record source system (group attributes by their source) &nbsp

However, I would add another important criteria: Clarity and Importance. It is well accepted that some attributes play a much more crucial role than others. Take for examples, Subscriber which are commonly used to derive important downstream semantics, like business logic, segmentation, grouping, etc (compare these with Attributes such as Address-data like Street Door-no).

So for these crucial Attributes, we model them in their own separate Table so that we can clearly spot them while browsing our DB schema metadata (see ex. SAT_SUB_STATUS in figure 2). And more importantly, mixing attributes will lead to additional records (due to historical change tracking). Our satellite will time slice against attributes of interested, and thus optimized for faster query.

→Call or Network Event
How should we model CDR’s data? Two issues arise when one wants to model these :

Business keys that can possibly identify each record not straightforward (complex and also prone to changes in time)
High data volumetry and burst rate may impose additional constraint on modelling

Can we classify a Call (Network event) as being a recognized business entity? It has clearly some characteristics, however it cannot easily be identified through limited business-keys. One could imagine that the following key combinations: 1) Calling MSISDN, 2) Called MSISDN and 3) Event_start_date.

These seem to satisfy the uniqueness criteria (at least for scenarios such as receiving voice mail at same time as making another phone call, etc...), but things get much more complicated now with the introduction of smart phone and their multitude protocol and access mode (HTTP, email, VoIP, VPN, ftp, Peer-to-Peer, TV streaming...). So to me, the Network Event cannot be modelled as a Hub, but rather as a pure Link.

Now, concerning data volumetry & burst rate. If we model these as Link and normalize all contextual data into numerous Satellites, we end up with many join operating on extremely large tables ! Furthermore, we can’t even take advantage of star join structure (with much smaller dimension tables having bitmap indexes surrounding the big fact table), as all satellites have the exact same size as the Links tables. In CDR’s context data are merely static, so there should not be any need to design Satellite to keep track of change in time.

SO here I'd argue we can loosen Link modeling principle and design what is called a Transaction Link, I think it makes perfect sense for our CDR’s scenario, as illustrated below:

Fig Network Events are modelled as a Transaction Link

Also worth mentioning is that most relationships are optionals, this is typically true when integrating multiple source systems (rated calls from billing, unrated from mediation, etc.). Depending on the event context, and there may or may not be a Rating_plan (unrated calls like incoming have none), Cell-id, Called/Callee are both required only for more traditional calls/sms, Cell-Id (caller may be roaming outside telco network), etc.    I’m not sure we should model these missing relationships as done with dimensional modelling (specific exception records exist in dimensions to avoid the missing Null FK’s).    Here, I’d recommend to simply leave the Null FK which reflect the way things are recorded at source system.

→Relationship with lifespan (temporal-variant)

DV models Link as (timeless) existence of a relationship, hence they record any association as seen from source, similarly as we do with Hub key.    However, some relationships are meant to have limited lifespan scope.   It then becomes critical to keep track of this lifetime as it validates the time period the relationship holds true. Not much is said about these.... could however find the following (from this cheat sheet online resource):
"We can end-date links using an ‘exists’-attribute in a satellite"

Therefore, we propose to create a Validity lifespan Satellite data around the Link entity to keep track of time period.   This is similar to contextual data around Hub which are always “ time-based”, except now these time-based is no longer metadata but rather functional, i.e. correspond to contextual data of the Links (in fact for this particular Link, no info is conveyed except the relationship itself and its time period validity).   In the Validity Satellite, the attribute Valid-from and Valid-to must correspond to time period, however this rule is not enforced here in case the Billing source-system its business rule in future (subs only has one Rating Plan at any time):

Fig Link between Subs and RatingPlan with limited timespan

→New Hub or Satellite attribute

What justify promoting a single Attribute into a recognized and separate Hub entity? In theory, we learnt that Hub must be an entity recognized in the organization with clear business key(s). However, there are quite a few attributes that have these characteristics: Retail plan, Account type, Promotion, Campaign, etc.... identifiable code associated to them). So I believe we need clearer Design criteria providing guidance to determine what can remain simple Attribute vs what needs to be promoted as a new Hub. I propose these simple criteria:

Pervasiness: Does the Attribute appear in many different source systems/contexts. If yes, then consider the Hub so that we minimize the negative impact of having non-standardized and non-conformed codes spread throughout the EDW data model.
Relationship: Can the Attribute be used in other relationship than its unique associated Hub. If yes, then consider the Hub so that we can capture these relationship and association elsewhere through the standard way : Links.
Evolution: Maybe, a data source newly integrated includes an important business concept still only used, for now, in isolation. If yes, consider the hub as newly fashionable concepts tend to proliferate quite rapidly these days...

Regarding these new design criteria, it is easy to see why an attribute like Retail-Plan (the identifiable code), was modelled as a Hub, since it is frequently used in many contexts (relate to Subscriber, relate to Campaign, can be found in other system, etc.). However, one could state that Address have same characteristics, so that it could as well be a Hub.... yes but unfortunately it rarely meets the 1st Hub design criteria: unique and easily identifiable Business key (unless your a shop specializing with addresses, do you have available in your system a clear and unique Business key for Addresses, most likely not?).

Martin

Thursday, February 28, 2013

Data Vault model: Mobile Telecom example (part-2)

This is the next part from a previous notes where I introduced "Data Vault model: Mobile Telecom example". In this note, I discuss the model following each step of the methodology.

Business Entities

Let’s start first with the identification of some of our key Business entities:

1. Customer

Definition: party that gets billed for the service/product offered by the Telecom
Characteristics

Business Key: BAN (billing account number)
History Evolution: quite stable as one BAN cannot change for the same account however, one subscriber can have different BAN when migrating account type (e.g. prepay to postpay).
Complexity: recursive relationship between BANs often needed to capture BAN hierarchies (BAN → parent BANs → root BAN) for large B2B customers.

2. Subscriber

Definition: individual attached to a particular subscription, i.e. MSISDN (mobile number that provides telecomunnication services).
Characteristics

Business Key: MSISDN, the telephone number given to the individual.
Key alternative: this Business key is not stable, as same subscriber typically changes MSISDN, so Billing systems usually provides an alternative key. This alternate subscriber-ID must be used instead.
Complexity: Not all OLTPs have same identifiable key, any systems not so concerned with history usually tracks the subscriber with its MSISDN at given Timestamp, whereas others come up with more stable key like Subscriber-ID or technical key attached to the SIM card like IMSI.

3. Product

Definition: In telco, the product does not correspond to your typical and tangible goods but rather associated to some networking features leveraged by your handset (e.g. SMS, MMS, call, voice mail, highspeed data, etc..).
Characteristics

Business Key: ProductCode (usually short text not so well standardized throughout the organization).
Complexity: Conformance and data integrity problems as each system may use its own, and usually involves some form of hierarchical representation.

4. Service

Definition: Service is usually tightly coupled with the notion of product. Subscribers purchase service (either inherited through some rating-plan or purchased individually) that include specific Product along with pricing and detail features. (e.g. SMS option, Voice mail international, Local calls special rating, unlimited data volume, free calls number, etc..).
Characteristics

Business Key: ServiceCode (again short text not so well standardized but mostly used by the Billing system).
Complexity: Usually involved some grouping or combination of more detail features related to network and Intelligent network layer (IN), and also linked to rating information used for billing purposes.

5. Handset

Definition: Device through which the subscriber access network product.
Characteristics

Business Key: TAC (standard code describing handset model manufacturer etc.). Not to be confused with IMEI which uniquely identifies each physical device (handset id).

6. Invoice

Definition: Bill statement sent regularly to the Customer for related subscription.
Characteristics

Business Key: Invoice number (used mostly by billing system).
Complexity: typically applicable for a subset of the entire customer population (postpay), as prepay is billed in realtime according to usage

There are obviously many more entities that quickly come to mind (e.g. Sales, Call-Center, Agent, Contact, Campaign, Port-in/out, Visits, etc..), but let’s stop here for the sake of simplicity.

Business Relationship Entities

Again, let’s give out a few examples of relationships that should cover some different modeling scenario. Telco are very complex organization with a large number of heterogenous systems, so these example are clearly over-simplified:

1. Customer hierarchy

Definition: Large B2B customers have usually organizational hierarchy struture for managing employee payment responsibilities and complexity
Characteristics

Data Model name: LNK_CUSTOMER_HIER
Natural recursive relationship: Each Account is associated to a direct parent Account (a single node) until we reach the top account or Root BAN. (BAN → parent BANs → .. → root BAN).
Links modelling: these recursive relationship are best modelled with a parent_key FK’s pointing back on same table. However in DV, linkage structure stands on its own, so these Links must be stored separately. The Child_key acts like the driving key of the relationship (is used to determine the end-dating of the link).

2. Subscriber RatingPlan

Definition: The subscription is associated to some form of rating plan.
Characteristics

Data Model name: LNK_SUBS_RPLAN
Natural relationship: Any subscriber must have one and only one RatingPlan at any point in time. This plan specifies the exact rating of all billable communication items.
Links modelling: this is modelled using a typical Link structure between Subscriber & RatingPlan accompanied with lifespan From/To Date in Satellite (see discussion “Relationship of limited lifespan” ).

3. Billing Invoice statement & Charge item

Definition: Billing statement is invoiced to each customer account at each end of bill cycle.
Characteristics

Data Model name: LNK_INVOICE & LNK_INVOICE_RECHARGE
Natural relationship: The Link relates Bill Statement with the customer, and also with the 1-to-many Subscribers attached to this Customer account. There is also a Link which itemizes each charge by Product (and depending on RatingPlan) associated to one Invoice (LNK_INVOICE). Things may vary here, but let’s keep it simple..
Links modelling: The relationship are regular Link between Customer, Bill-Statement and Subscriber for the Invoice. And for Invoice-charge, we tie it to the Product (item being billed).

4. Recharge Activity

Definition: Prepay subscriber must recharge (top-up) their balance as they go.
Characteristics

Data Model name: LNK_RECHARGE
Natural relationship: This recharge activity is associated to various channel and payment method.
Links modelling: A simple Link structure with FKs referring to Subscriber and Dealers (i.e. the channel through which the subscriber purchased new credit).

5. Network Event (calls, sms, voicemail, internet)

Definition: Transactions corresponding to network events which are consumed by the telco subscriber.
Characteristics

Data Model name: LNK_NETWORK_EVENT
Natural relationship: Depending on the communication product (voice calls, sms, mms, gprs data, umts data, etc..), various entities are involved and can be linked.

Contextual and Descriptive Data

These are meant to convey the time-variant and historical part of each attribute captured by the EDW. Modelling these as Satellites, we allow for timeline slicing and taking snapshot of how things (i.e. business state) were at that time. The main advantage of Satellite, is we usually don't know in advance all features/attributes of interest needed... and even if you knew, things evolve at such pace that most likely you'll need to revisit them soon. With Satellites, you can build them incrementally, bit by bit according to business needs and scope changes without affecting existing structure, as long as you keep them normalized and fine-grained. No longer needed Satellite can simply stop being loaded, while new one get created.

As my goal is not to capture a complete Mobile EDW, I've only included a few Satellite here for
illustration purposes. These would be typical features/attributes associated to Hub/Link entities. They are not meant to be exhaustive but rather convey the meaning of the different entities.

The rest of the note is presented in the part-3, where some details and highlights are presented related to design decision and open issues that got raised during the modelling.

See last part here: Data Vault model: Mobile Telecom example (part-3)

Martin

Data Vault model: Mobile Telecom example

Context

This is my tentative to model an EDW Integrated layer leveraging the Data Vault approach. This is done for the industry I know best, i.e. the Mobile Communication industry. But first, here are a few remarks about the context of this doc:

DV modelling is often applied for the auditability of OLTP source systems, where one wishes to produce a system of record to guarantee full traceability, completeness and historical tracking of the data. I’m not interested in these goals for which there already exist tools/methods tp automatically populate what is often called Operational Datastores.

Most transactional systems in big organisation (e.g. Billing, CRM, ERP, etc.) are overly complex- This is partly due to satisfying contradicting goals : being generic while providing detail completeness. And typically in BI, only a subset of their data contents is worthwhile for building the EDW.

My main interest is Data Integration: how various data elements generated by business processes can be integrated into one common EDW layer using DV modelling approach independently of the OLTP data sources. This is in agreeing that EDW Integrated Layer should not contain any form of data transformation, cleansing, calculation and derivation- We leave these business rules transformation to downstream Layers closer to business user (this is not to say that EDW layer is simply storing data as-is without some form of integration).

The focus is on the Business Entities stored as Hub along their natural keys, the Transaction/Association forming the Link. There is not much effort spent on the descriptive data content (i.e. the Satellite).

The data model is much simplified without much concerns in terms of technical implementation details (more functional oriented than technical). Satellites data is just provided as examples to help better understanding the semantic of each model entities.

The data model serves more as a way to provide validation of the DV modelling approach when used in the context of building an Integrated layer of EDW.

So the presented model is not based on reverse engineering of existing OLTPs, but rather aligned on functional analysis of business processes typical of Mobile communication organization. I was arguing, in previous post, that not much integration is possible when DV modelling is simply used to store the OLTPs source data (100% of data 100% of the time). This is a recognized issue in the context of the overall DW architecture, as people advocates the use of multi-layers Data Vault. For the first level, we would then talk in terms of the Raw Vault, and for the second layer we would talk in terms of Business Data Vault. This is broadly speaking and there is still confusion and misalignment about the exact terms and meaning (as found here).

In this respect, I favourably adopt the position articulated by DataVault academy as shown in this training video or this high-level methodology guide. In summary, we should not only look at source system data/metadata for EDW modelling, but also look at the business side with aspects like :

Business processes
Enterprise initiatives (logical/semantic models, utilised datamart, ..)
Business interviews and session
Industry models
Taxonomies

Modelling Methodology

1. Business Entity Identification and Modelling

What are the business entities (core concepts like master data, events, or even transactions)
Establish the EWBK -Enterprise Wide Business Keys- recognized by all business stakeholders. These are critical in integrating multiple data sources into a common EDW. Integration often requires the use of higher-level keys and not simply raw key found in the different OLTPs.
Model the Hub around these keys

2. Business Relation Identification and Modelling

What are the natural relationships between entities
Analyze relationships involved during application of business process. These may even involve multiple atomic transactions (long transaction) often referred to as Unit of Work with multiple data sources (as opposed to a single-source transactional UoW).
Model the Link around entities involved in these relationship

3. Context identification and Modelling

What are the descriptive and context data associated to each Entity and Relationship
Model the Satellite around these descriptive data using your choice of Satellite design criteria

In summary, whatever perspective you adopt, DV model should aim for this clear and simple objective :

"Separate business keys from relationships and from contextual data"

I don't remember where I've read that statement but I find it elegant and quite coherent to what the goal of DV modelling is in comparison to other modelling approaches like Dimensional modelling.

Before explaining the end-result model, here's how it looks :

Fig 1. View using an organic layout of the overall DV model example
(blue, yellow and red correspond to Hubs, Links and Satellite respectively )

You can also refer to the original PNG-file for all details.

The diagram has been formatted organically to better visualize the role of each entity type. Satellites are surrounding their Hubs/Links while Links tie these clusters together. Exception to that is the hanging cluster with missing Links. These corresponds to Hub_service and Hub_contract, and analysing additional business processes would be necessary to correct this situation.

Modification
My notes are often written for archival purposes without focusing too much on readability. As more people gets redirected here, it seems I should do a bit of effort in putting these long notes into smaller and more digest form.

So you can follow the next part here: Data Vault model: Mobile Telecom example part-2.

Martin

Monday, September 03, 2012

Understanding and Using SVD with large dataset

When confronted with large and complex dataset, very useful information can be obtained by applying some form of Matrix decomposition. One example is the Singular Value Decomposition (SVD) whose principles yielded the derivation of a number of very useful application in today's digitized world:

Lossly Data Compression algorithm based on SVD as a form of change of basis: JPEG image compression using FFT basis, or yet the Wavelet-based image compression (wavelet basis). A good ref: http://cklixx.people.wm.edu/teaching/m2999-3f.pdf
Information Retrieval: large document repository often make use of SVD to improve its searching capability (usually named LSI, or Latent Semantic Indexing)
Noise removal in large dataset (see this point later)
Collaborative Filtering: recommended system used by commercial website (books, movies...) uses SVD to derive information about purchase patterns of customers to give recommendation to other similar customer. Netflix contest probably gave the greatest boost to SVD popularity!
For Spatial data analysis. In fields such atmospheric ocean and meteorology, SVD is the “most widely-used multivariate statistical” tools (ref. http://iridl.ldeo.columbia.edu/dochelp/StatTutorial/SVD/)

In this post, I discuss the underlying theory of SVD, its various interpretation and properties. I conclude by mentioning briefly two usages in analytics or data mining field. For those like me needing a refresher on linear algebra essential for understanding SVD, I highly recommend the lectures given by Prof Strang and freely available through MIT Open courseware (video link). For those interested in its application in relation to Data mining, go check David Skillicorn's very insighful and practical book named "Understanding Complex Datasets".

The Math

Consider the matrix A of size nXp (rowXcol) having n objects (or instances) characterized by p measured attributes (or variables). [Note in Lin. Alg. we normally use nXm (Amxn) but in more applied field we usually have quite a lot more rows than attributes and it is custom to use n for indicating #of observations and p the variables (so that p << n)].

Now, SVD's goal is to find some some vectors v in Rp spanning the rowspace of A and all orthogonal to each other (i.e. v Є R(A)). These vectors v, when multiplied by the matrix A, results to a vector u in Rn which are all orthogonal to each other By convention, v’s are taken ortho-normal. Hence the vectors v form an orthogonal basis of the Rowspace of A (aka domain of A) while the vectors u form an orthogonal basis of the Colspace of A (aka range of A).

In the most general form, A is a real matrix, so we could find at most p orthogonal vector v's (dimension of the basis in Rp) as well as n orthogonal vector u's (dimension of the basis in Rn). Next diagram illustrates this general form :

However, not all of them are relevant as the matrix A may have a rank r lower than either p or n. The rank of a matrix rank(A) corresponds to the number of independent row vectors or column vectors (both are the same). Because there can only be r independent components, the SVD decomposition can only give r relevant SV's, so there will be (p-r) useless SV's all equal to 0. This reduced form can be represented as a vector form like:

A = ( u1 u2 ... ur | ur+1 .. un)nXn ( sv1 sv2 ... svr | 0r+1 0r+1 ... 0p )pXp (vt1 vt2 ... vtr | vtr+1 ... vtp )

Or similarly using the same schematic diagram, we can show this reduced form as:

Now in practice, we have deep Matrix (n>>p many more instances than attributes), so that r=p. The decomposition can then only yield p orthogonal v's leading to p relevant u's.

Conversely, in situation with much wider Matrix (n<<p, as with text mining with documents represent n instances and words being p attributes). These now give r = n, and only n vectors v's are relevant (the rest will be in the N(A) with 0 SV's).

Using Matrix form to express the decomposition gives the formula:

A V = U Σ

where Σ is a diagonal matrix containing factors which, in effect, normalizes all u’s. These factors called Singular Values (SV) are found in the diagonal entries of matrix Σ. Rearranging a bit, and noticing V-1 = Vt as V is composed of orthonormal vectors (the inverse of an orthogonal matrix is equal to its transpose), gives the following :

A = U Σ Vt

This is SVD decomposition where the original Matrix A is now factorized into U (holding n instance of objects) and weighted by SV (Σ being diagonal have all zeros entries except on diagonal), and the pXp matrix Vt (holding the m orthogonal components).

SVD re-expresses the original data set (our Matrix A of rank r) in compact and orthogonal form by using:

r orthonormal instance vector rows taken in rowspace of A, i.e. the vector v's (aka right singular vector)

r orthonormal attribute vector columns taken in columspace of A, i.e. the vector u's (aka the left singular vector)

In other word, SVD produces a new orthogonal expression form which takes into account both correlation of variables and correlation of instances. This re-expression has also a notion of importance (the SVs value) so that one can easily drop less important dimensions (smaller SVs) and end up working in a smaller, more compressed and more palatable data set form.

Let's focus on the first scenario as most typically found in data mining app. With the p vector v, we have the first r coming from the dimension of rowspace of A and the rest (p-r) would be taken from Nullspace of A (N(A)) with corresponding SV=0. However this is not needed in usual Data mining setting, as we are in front of a very long and narrow A, so r=p.

Consequently it’s nearly impossible to have perfectly correlated attributes (linearly dependent) so that not a single attribute could be replaced by a linear combination of the others. In other words, the very long column vectors contained in col of A are all independent and each attributes contribute to bringing new information (in practice we would obviously ignore an attribute that is purely a combination of the other).

Relationship with PCA

Working out a bit the equation by multiplying both side by At and its equivalent (U Σ Vt)t will give either the correlation Matrix of attributes (the dot product of long attributes vector) or the correlation Matrix of objects (the dot product of short objects vector) and will eliminate either U or V depending whether we multiply on the left:

At A = V2 Vt

and we multiply on the right:

A At = U Σ2 Ut

So U and V matrix are actually eigen vectors matrices of both correlation Matrix. Any symmetric matrix B produces Eigenvalues that are real, and Eigenvector matrix Q that are always orthogonal : B = Q ⋀ Qt with ⋀ being the Eigenvalues diagonal matrix. In our case, V corresponds to Eigenvectors of the correlation Matrix of attributes (At A) while U to Eigenvectors of the correlation Matrix of objects (A At).

This important point relates SVD with the PCA technique (Principal Component Analysis) often used to reduce dimensionality in data set.

Actually, this dimension reduction happens when taking the first v1 yielding the largest singular values or equivalently the longest vector u1 before its normalization. This is easily seen when we notice that each entries of u1 corresponds to the projected length value between the vectors ai (rows of A) and the vector v1 (simple dot-product). Hence v1 gives the new axis along which as much variance is preserved: v1 is the PCA-1! All others vs ranked by SV’s magnitude correspond to the other ordered PCA’s.

PCA considers the first few Eigenvectors of the covariance matrix (in order to work on normalized attribute) according to magnitude of Eigenvalues. When done this way PCA analyses along the attributes only (not along instances) yielding a decomposition with a smaller number of transformed attributes (orthogonal to each other) which preserve as much variance among original attributes. Although PCA can operate on correlation Matrix of objects (A At) it must be done in a different decomposition, as such SVD is more general as it analyses the dataset simultaneously in terms of inter-variables and inter-instances correlation!

Dimension Compression/Truncation (lowering the Rank!)

One of the goal when applying SVD is to get a more compact view of the data which is as close as possible to the original set. We know that choosing r relevant SV’s (rank of A) will result in a perfect decomposition where U Σ Vt matches exactly A. However, can can also rank SV’s by their magnitudes and decide to just pick the first k ones, so that the other r-k are considered much too small to represent any important data structure. So when we select and retain only some parts of the decomposition (dimensions), we actually produce a truncated version of A with a simplified data structure that concentrate the most variation.

Proceeding with this truncation can be done to discard what can be considered as nose, or else because the discarded parts are considered to come from processes we do not intend to keep. Refer to the Component interpretation below which is well suited to view the SVD as a way to decompose the data set into its underlying independent generation processes.

Hence, the SVD is usually performed to reduce the dimensions of the original dataset while providing the best reconstitution of the original matrix by a matrix with a lower rank.

The Interpretation

There are a number of ways one can view or interpret the SVD decomposition. These arise from the various way we can proceed with Matrices multiplication. It follows that according to how these multiplications are viewed, we can extract from SVD some insights and particular findings.

Given the general matrix multiplication form, where only k SV’s are considered out of r possibles dimensions. The associative rule of Matrix allows us to either multiply the S V Σ diagonal matrix first with the matrix U, or alternatively with the transpose of V (V’ in the diagram corresponds to Vt).

Now considering this general view, we can consider different interpretations.

The Geometric interpretation:

Considering each instance (row of A) as a vector in m-dimensional space, the decomposition will transform the original coordinates (the normal basis in Rm) into a new set of coordinates given in each row of Vt. Each instance is now expressed in this new set of coordinates as given by the entries in U.

The result is that the matrix Vt operates as a rotation on the original instance vectors while the matrix Σ operates as a stretch/shrinkage on the axes.

Illustrating this with a picture, we can simply imagine a unit sphere in the new transformed coordinate (the v’s) and its shape prior to the SVD transormation :(ref. http://www.aiaccess.net )

The Factor interpretation:

In this interpretation, the factor are held in the rows of V’, so there are k factors with each being weighted by the SV’s contained in Σ. This weighting gives level of importance of each factor based on SV’s difference in values. The U matrix list all instances of objects (i.e. the rows) with each instance showing specific level for each factors (i.e. the entries or attribute value in u representing the level).

The SVD can gives us some indication of the latent factors and how each instance show different level of expressiveness in these latent factors. This is often utilized in the social sciences where, data collected in survey, record a number of “superficial properties or actions” while the goal of the survey is to discover “deeper drivers or motivators”. It is expected that the Factor will reveal these drivers that cause the recorded superficial properties to be observed (this interpretation works best small sample with correlated attribute, see Skillicorn’s book for more details and the ranking wine example).

The component interpretation:

Matrix multiplication can also operate col by row piecewise to produce intermediate matrices Ai. This is called the outer product expansion and each Ai is a matrix of Rank 1. For a matrix A with rank=r, we can have at most r of these intermediate matrices, but we’ll choose to only keep the first k most important ones. Each correspond to a component.

Within this interpretation, we assume that the data generation producing A, is actually the result of the superposition of a number of processes, which only a few are considered of interest. The SVD will discover these underlying processes as independent components, although there may not be necessarily a true correspondence between these components and the exact underlying process. In practice, such correspondence can be found especially for the larger SV’s components.

Constraint / Assumption on data

Under linear algebra theory, SVD imposes to have real Eignevalues and orthogonal Eigenvectors. This is always satisfied for real symetric Matrix, such as these matrices AtA or AAt (ref. spectral decomposition prerequisite for Eigenvalues & vectors).

However, for data analysis purposes, there is additional constraint that each variables should follow approximately a Gaussian distribution. If we have a dataset which presents a set of different clusters or none-normal shape then, from the geometry interpretation, we can see that the new axis will not be so meaningful. The new axis will go along, again, the greatest variance, which in turn may be completely missing the set of clusters present in the dataset.

It is also custom to normalize the data before proceeding with its decomposition. Why? Because the SV magnitude will give an indication as to the importance of each SVs in explaining the variance structure of the dataset. This assumes we choose normalized vectors in V's and U's (vector of unit length), but also that we first normalize the attributes in A .

When attribute within the A dataset are not centered around their mean, the first singular vector v will just point to the longest direction toward the centroid of our data set. This will also affect all remaining singular vector which must be orthogonal to this first one. So to avoid this issue of wrong representation, we normalize all attributes so that their centroid gravitate around the origin.

Also, If we had some attributes with significant difference in range, then during numerical computation these attribute would have greater influence in the factorization process. This normalization can be done in numerous way:

Taking the z-score of each attribute values (subtracting the mean and dividing by their standard deviation) so that the majority of transformed values will be in the range of -1 to 1. However, this cannot be applied when attribute's distribution are far from Gaussian

Taking rank where new values take the rank order of their original values.

other more complex forms.

Note on Categorical attribute:

Similarly we must be careful when converting categorical value to number in order not to convey artificial ordering or spurious correlation when no natural ordering exist for the category. Ways to avoid that:

Map each category to two variable located in at equal distance along a circle, or yet three variables along a sphere (when number of categories is larger)

Similar but more complex Mapping into a corner of a generalized tetrahedron (ex. http://arxiv.org/pdf/0711.4452.pdf)

The Application in DM : Data reduction and explanatory

One popular application of SVD in data analysis is to reduce complexity and size of any dataset. The method will allow us to discover a smaller number of factors that can be related to the larger set of original attributes. This is a useful step first toward a better understanding of the inter-correlation structure between the attribute and observations.

Similarly, we can choose voluntarily to ignore some less important components, as these may be considered redundant and offer no important structure in our data analysis. Or yet, we may want to visualize our dataset in a 2 or 3-D view in the best possible way.

A word of caution: in data mining setting involving very large dataset, this method is not so useful as there will be many attributes partially correlated to each others in complex and subtle way. Hence, the discovered factors become hardly interpretable

The Application in DM : Noise reduction

Noise in dataset usually origin and combine from a number of multitude source, which results in adding Gaussian-like variability in our dataset, which has a net effect of increasing the apparent dimensionality of the dataset.

Let’s say we have a few variables which turn out to be highly correlated, adding noise to these observed variables will in fact obscure or the correlation pattern and making the dataset to appear at higher dimensionality than it really is.

The ordering nature of the SV’s value allow us to eliminate the later dimensions offering very little or no structure compared to the prior ones. All SV’s after a certain threshold can then be considered as noise, the challenge is to estimate this threshold (few methods exist, refer to Skillicorn book).