Martin Ouellet Notes

Friday, March 01, 2013

Data Vault model: Mobile Telecom example (part-3)

Here's my last note concerning "Data Vault model: Mobile telecom example" where some points and issues are discussed.

Open Points and Discussion

→Business Key inconsistencies
Among all OLTPs used as source, it is very unlikely that they will all share exact same Key(s) to represent same business entity ! If we are lucky, keys will vary in format only, but more frequently, keys can be totally different.

This challenge is referred in the DV community as the "Business Key Alignment", where raw keys coming from OLTPs must somehow be integrated (and thus aligned) toward the universal EWBK.
Next point discusses these points for the most important Key of all Telco: Subscriber Business Key.

→Subscriber identification
In telco, any subscriber is uniquely identifiable through its phone number (MSISDN) but only at given point in time (subscribers can change phone number!). This makes MSISDN only a valid temporal-based Key. We are normally more interested in tracking people than number, then we certainly need an alternative business key. Billing systems already accounted for these changes of numbers, so these should be used as the Master Data Source for our Subscriber Hub’s Business key. Other transactional systems in telco, will often make use of temporal-based keys, that is fine in for transaction-based context that always refer to a point in time. Another example of this is the IMSI which is the unique subscriber number soft-wired at the SIM card level, and somewhat more stable than MSISDN (still an issue when subscriber loses his SIM).

These points suggest important properties needed for Hub Business Key:

Business Key should be immutable
Business Key should come from one Master Data Source
Business Key may not always be well recognized by business users, especially not at the sacrifice of immutability.

So here, we decide to choose as Business Key, the Subscriber-Id having these properties, and maintained in our Telco Billing platform (Master data source).

Now, how should we model the other mutable Keys? They will likely be needed while we integrate other OLTPs platforms which rely on them. To be able to map transactions, links coming from these others platforms, we need to record the state of the subscriber mutable keys (ex. MSISDN) according to the chosen Hub key. As such, we propose to hook them up through the use of dedicated Satellite giving their key values at any point in time (given a Business Key).

Fig Data model around the Subscriber Hub

→Satellite Design Criteria
When it is time to model Satellites around each Hub we have quite a lot of freedom. We can think of a spectrum modelling style where we have at one extreme One Satellite per Hub, and at the other extreme One Satellite per Attribute (from completely denormalized to highly normalized).

We can consider these criteria for modelling guidance :

Rate of Change (group attributes of similar life-cycle)
Types of Data (group attributes of similar nature)
Record source system (group attributes by their source) &nbsp

However, I would add another important criteria: Clarity and Importance. It is well accepted that some attributes play a much more crucial role than others. Take for examples, Subscriber which are commonly used to derive important downstream semantics, like business logic, segmentation, grouping, etc (compare these with Attributes such as Address-data like Street Door-no).

So for these crucial Attributes, we model them in their own separate Table so that we can clearly spot them while browsing our DB schema metadata (see ex. SAT_SUB_STATUS in figure 2). And more importantly, mixing attributes will lead to additional records (due to historical change tracking). Our satellite will time slice against attributes of interested, and thus optimized for faster query.

→Call or Network Event
How should we model CDR’s data? Two issues arise when one wants to model these :

Business keys that can possibly identify each record not straightforward (complex and also prone to changes in time)
High data volumetry and burst rate may impose additional constraint on modelling

Can we classify a Call (Network event) as being a recognized business entity? It has clearly some characteristics, however it cannot easily be identified through limited business-keys. One could imagine that the following key combinations: 1) Calling MSISDN, 2) Called MSISDN and 3) Event_start_date.

These seem to satisfy the uniqueness criteria (at least for scenarios such as receiving voice mail at same time as making another phone call, etc...), but things get much more complicated now with the introduction of smart phone and their multitude protocol and access mode (HTTP, email, VoIP, VPN, ftp, Peer-to-Peer, TV streaming...). So to me, the Network Event cannot be modelled as a Hub, but rather as a pure Link.

Now, concerning data volumetry & burst rate. If we model these as Link and normalize all contextual data into numerous Satellites, we end up with many join operating on extremely large tables ! Furthermore, we can’t even take advantage of star join structure (with much smaller dimension tables having bitmap indexes surrounding the big fact table), as all satellites have the exact same size as the Links tables. In CDR’s context data are merely static, so there should not be any need to design Satellite to keep track of change in time.

SO here I'd argue we can loosen Link modeling principle and design what is called a Transaction Link, I think it makes perfect sense for our CDR’s scenario, as illustrated below:

Fig Network Events are modelled as a Transaction Link

Also worth mentioning is that most relationships are optionals, this is typically true when integrating multiple source systems (rated calls from billing, unrated from mediation, etc.). Depending on the event context, and there may or may not be a Rating_plan (unrated calls like incoming have none), Cell-id, Called/Callee are both required only for more traditional calls/sms, Cell-Id (caller may be roaming outside telco network), etc.    I’m not sure we should model these missing relationships as done with dimensional modelling (specific exception records exist in dimensions to avoid the missing Null FK’s).    Here, I’d recommend to simply leave the Null FK which reflect the way things are recorded at source system.

→Relationship with lifespan (temporal-variant)

DV models Link as (timeless) existence of a relationship, hence they record any association as seen from source, similarly as we do with Hub key.    However, some relationships are meant to have limited lifespan scope.   It then becomes critical to keep track of this lifetime as it validates the time period the relationship holds true. Not much is said about these.... could however find the following (from this cheat sheet online resource):
"We can end-date links using an ‘exists’-attribute in a satellite"

Therefore, we propose to create a Validity lifespan Satellite data around the Link entity to keep track of time period.   This is similar to contextual data around Hub which are always “ time-based”, except now these time-based is no longer metadata but rather functional, i.e. correspond to contextual data of the Links (in fact for this particular Link, no info is conveyed except the relationship itself and its time period validity).   In the Validity Satellite, the attribute Valid-from and Valid-to must correspond to time period, however this rule is not enforced here in case the Billing source-system its business rule in future (subs only has one Rating Plan at any time):

Fig Link between Subs and RatingPlan with limited timespan

→New Hub or Satellite attribute

What justify promoting a single Attribute into a recognized and separate Hub entity? In theory, we learnt that Hub must be an entity recognized in the organization with clear business key(s). However, there are quite a few attributes that have these characteristics: Retail plan, Account type, Promotion, Campaign, etc.... identifiable code associated to them). So I believe we need clearer Design criteria providing guidance to determine what can remain simple Attribute vs what needs to be promoted as a new Hub. I propose these simple criteria:

Pervasiness: Does the Attribute appear in many different source systems/contexts. If yes, then consider the Hub so that we minimize the negative impact of having non-standardized and non-conformed codes spread throughout the EDW data model.
Relationship: Can the Attribute be used in other relationship than its unique associated Hub. If yes, then consider the Hub so that we can capture these relationship and association elsewhere through the standard way : Links.
Evolution: Maybe, a data source newly integrated includes an important business concept still only used, for now, in isolation. If yes, consider the hub as newly fashionable concepts tend to proliferate quite rapidly these days...

Regarding these new design criteria, it is easy to see why an attribute like Retail-Plan (the identifiable code), was modelled as a Hub, since it is frequently used in many contexts (relate to Subscriber, relate to Campaign, can be found in other system, etc.). However, one could state that Address have same characteristics, so that it could as well be a Hub.... yes but unfortunately it rarely meets the 1st Hub design criteria: unique and easily identifiable Business key (unless your a shop specializing with addresses, do you have available in your system a clear and unique Business key for Addresses, most likely not?).

Martin

Thursday, February 28, 2013

Data Vault model: Mobile Telecom example (part-2)

This is the next part from a previous notes where I introduced "Data Vault model: Mobile Telecom example". In this note, I discuss the model following each step of the methodology.

Business Entities

Let’s start first with the identification of some of our key Business entities:

1. Customer

Definition: party that gets billed for the service/product offered by the Telecom
Characteristics

Business Key: BAN (billing account number)
History Evolution: quite stable as one BAN cannot change for the same account however, one subscriber can have different BAN when migrating account type (e.g. prepay to postpay).
Complexity: recursive relationship between BANs often needed to capture BAN hierarchies (BAN → parent BANs → root BAN) for large B2B customers.

2. Subscriber

Definition: individual attached to a particular subscription, i.e. MSISDN (mobile number that provides telecomunnication services).
Characteristics

Business Key: MSISDN, the telephone number given to the individual.
Key alternative: this Business key is not stable, as same subscriber typically changes MSISDN, so Billing systems usually provides an alternative key. This alternate subscriber-ID must be used instead.
Complexity: Not all OLTPs have same identifiable key, any systems not so concerned with history usually tracks the subscriber with its MSISDN at given Timestamp, whereas others come up with more stable key like Subscriber-ID or technical key attached to the SIM card like IMSI.

3. Product

Definition: In telco, the product does not correspond to your typical and tangible goods but rather associated to some networking features leveraged by your handset (e.g. SMS, MMS, call, voice mail, highspeed data, etc..).
Characteristics

Business Key: ProductCode (usually short text not so well standardized throughout the organization).
Complexity: Conformance and data integrity problems as each system may use its own, and usually involves some form of hierarchical representation.

4. Service

Definition: Service is usually tightly coupled with the notion of product. Subscribers purchase service (either inherited through some rating-plan or purchased individually) that include specific Product along with pricing and detail features. (e.g. SMS option, Voice mail international, Local calls special rating, unlimited data volume, free calls number, etc..).
Characteristics

Business Key: ServiceCode (again short text not so well standardized but mostly used by the Billing system).
Complexity: Usually involved some grouping or combination of more detail features related to network and Intelligent network layer (IN), and also linked to rating information used for billing purposes.

5. Handset

Definition: Device through which the subscriber access network product.
Characteristics

Business Key: TAC (standard code describing handset model manufacturer etc.). Not to be confused with IMEI which uniquely identifies each physical device (handset id).

6. Invoice

Definition: Bill statement sent regularly to the Customer for related subscription.
Characteristics

Business Key: Invoice number (used mostly by billing system).
Complexity: typically applicable for a subset of the entire customer population (postpay), as prepay is billed in realtime according to usage

There are obviously many more entities that quickly come to mind (e.g. Sales, Call-Center, Agent, Contact, Campaign, Port-in/out, Visits, etc..), but let’s stop here for the sake of simplicity.

Business Relationship Entities

Again, let’s give out a few examples of relationships that should cover some different modeling scenario. Telco are very complex organization with a large number of heterogenous systems, so these example are clearly over-simplified:

1. Customer hierarchy

Definition: Large B2B customers have usually organizational hierarchy struture for managing employee payment responsibilities and complexity
Characteristics

Data Model name: LNK_CUSTOMER_HIER
Natural recursive relationship: Each Account is associated to a direct parent Account (a single node) until we reach the top account or Root BAN. (BAN → parent BANs → .. → root BAN).
Links modelling: these recursive relationship are best modelled with a parent_key FK’s pointing back on same table. However in DV, linkage structure stands on its own, so these Links must be stored separately. The Child_key acts like the driving key of the relationship (is used to determine the end-dating of the link).

2. Subscriber RatingPlan

Definition: The subscription is associated to some form of rating plan.
Characteristics

Data Model name: LNK_SUBS_RPLAN
Natural relationship: Any subscriber must have one and only one RatingPlan at any point in time. This plan specifies the exact rating of all billable communication items.
Links modelling: this is modelled using a typical Link structure between Subscriber & RatingPlan accompanied with lifespan From/To Date in Satellite (see discussion “Relationship of limited lifespan” ).

3. Billing Invoice statement & Charge item

Definition: Billing statement is invoiced to each customer account at each end of bill cycle.
Characteristics

Data Model name: LNK_INVOICE & LNK_INVOICE_RECHARGE
Natural relationship: The Link relates Bill Statement with the customer, and also with the 1-to-many Subscribers attached to this Customer account. There is also a Link which itemizes each charge by Product (and depending on RatingPlan) associated to one Invoice (LNK_INVOICE). Things may vary here, but let’s keep it simple..
Links modelling: The relationship are regular Link between Customer, Bill-Statement and Subscriber for the Invoice. And for Invoice-charge, we tie it to the Product (item being billed).

4. Recharge Activity

Definition: Prepay subscriber must recharge (top-up) their balance as they go.
Characteristics

Data Model name: LNK_RECHARGE
Natural relationship: This recharge activity is associated to various channel and payment method.
Links modelling: A simple Link structure with FKs referring to Subscriber and Dealers (i.e. the channel through which the subscriber purchased new credit).

5. Network Event (calls, sms, voicemail, internet)

Definition: Transactions corresponding to network events which are consumed by the telco subscriber.
Characteristics

Data Model name: LNK_NETWORK_EVENT
Natural relationship: Depending on the communication product (voice calls, sms, mms, gprs data, umts data, etc..), various entities are involved and can be linked.

Contextual and Descriptive Data

These are meant to convey the time-variant and historical part of each attribute captured by the EDW. Modelling these as Satellites, we allow for timeline slicing and taking snapshot of how things (i.e. business state) were at that time. The main advantage of Satellite, is we usually don't know in advance all features/attributes of interest needed... and even if you knew, things evolve at such pace that most likely you'll need to revisit them soon. With Satellites, you can build them incrementally, bit by bit according to business needs and scope changes without affecting existing structure, as long as you keep them normalized and fine-grained. No longer needed Satellite can simply stop being loaded, while new one get created.

As my goal is not to capture a complete Mobile EDW, I've only included a few Satellite here for
illustration purposes. These would be typical features/attributes associated to Hub/Link entities. They are not meant to be exhaustive but rather convey the meaning of the different entities.

The rest of the note is presented in the part-3, where some details and highlights are presented related to design decision and open issues that got raised during the modelling.

See last part here: Data Vault model: Mobile Telecom example (part-3)

Martin

Data Vault model: Mobile Telecom example

Context

This is my tentative to model an EDW Integrated layer leveraging the Data Vault approach. This is done for the industry I know best, i.e. the Mobile Communication industry. But first, here are a few remarks about the context of this doc:

DV modelling is often applied for the auditability of OLTP source systems, where one wishes to produce a system of record to guarantee full traceability, completeness and historical tracking of the data. I’m not interested in these goals for which there already exist tools/methods tp automatically populate what is often called Operational Datastores.

Most transactional systems in big organisation (e.g. Billing, CRM, ERP, etc.) are overly complex- This is partly due to satisfying contradicting goals : being generic while providing detail completeness. And typically in BI, only a subset of their data contents is worthwhile for building the EDW.

My main interest is Data Integration: how various data elements generated by business processes can be integrated into one common EDW layer using DV modelling approach independently of the OLTP data sources. This is in agreeing that EDW Integrated Layer should not contain any form of data transformation, cleansing, calculation and derivation- We leave these business rules transformation to downstream Layers closer to business user (this is not to say that EDW layer is simply storing data as-is without some form of integration).

The focus is on the Business Entities stored as Hub along their natural keys, the Transaction/Association forming the Link. There is not much effort spent on the descriptive data content (i.e. the Satellite).

The data model is much simplified without much concerns in terms of technical implementation details (more functional oriented than technical). Satellites data is just provided as examples to help better understanding the semantic of each model entities.

The data model serves more as a way to provide validation of the DV modelling approach when used in the context of building an Integrated layer of EDW.

So the presented model is not based on reverse engineering of existing OLTPs, but rather aligned on functional analysis of business processes typical of Mobile communication organization. I was arguing, in previous post, that not much integration is possible when DV modelling is simply used to store the OLTPs source data (100% of data 100% of the time). This is a recognized issue in the context of the overall DW architecture, as people advocates the use of multi-layers Data Vault. For the first level, we would then talk in terms of the Raw Vault, and for the second layer we would talk in terms of Business Data Vault. This is broadly speaking and there is still confusion and misalignment about the exact terms and meaning (as found here).

In this respect, I favourably adopt the position articulated by DataVault academy as shown in this training video or this high-level methodology guide. In summary, we should not only look at source system data/metadata for EDW modelling, but also look at the business side with aspects like :

Business processes
Enterprise initiatives (logical/semantic models, utilised datamart, ..)
Business interviews and session
Industry models
Taxonomies

Modelling Methodology

1. Business Entity Identification and Modelling

What are the business entities (core concepts like master data, events, or even transactions)
Establish the EWBK -Enterprise Wide Business Keys- recognized by all business stakeholders. These are critical in integrating multiple data sources into a common EDW. Integration often requires the use of higher-level keys and not simply raw key found in the different OLTPs.
Model the Hub around these keys

2. Business Relation Identification and Modelling

What are the natural relationships between entities
Analyze relationships involved during application of business process. These may even involve multiple atomic transactions (long transaction) often referred to as Unit of Work with multiple data sources (as opposed to a single-source transactional UoW).
Model the Link around entities involved in these relationship

3. Context identification and Modelling

What are the descriptive and context data associated to each Entity and Relationship
Model the Satellite around these descriptive data using your choice of Satellite design criteria

In summary, whatever perspective you adopt, DV model should aim for this clear and simple objective :

"Separate business keys from relationships and from contextual data"

I don't remember where I've read that statement but I find it elegant and quite coherent to what the goal of DV modelling is in comparison to other modelling approaches like Dimensional modelling.

Before explaining the end-result model, here's how it looks :

Fig 1. View using an organic layout of the overall DV model example
(blue, yellow and red correspond to Hubs, Links and Satellite respectively )

You can also refer to the original PNG-file for all details.

The diagram has been formatted organically to better visualize the role of each entity type. Satellites are surrounding their Hubs/Links while Links tie these clusters together. Exception to that is the hanging cluster with missing Links. These corresponds to Hub_service and Hub_contract, and analysing additional business processes would be necessary to correct this situation.

Modification
My notes are often written for archival purposes without focusing too much on readability. As more people gets redirected here, it seems I should do a bit of effort in putting these long notes into smaller and more digest form.

So you can follow the next part here: Data Vault model: Mobile Telecom example part-2.

Martin

Monday, September 03, 2012

Understanding and Using SVD with large dataset

When confronted with large and complex dataset, very useful information can be obtained by applying some form of Matrix decomposition. One example is the Singular Value Decomposition (SVD) whose principles yielded the derivation of a number of very useful application in today's digitized world:

Lossly Data Compression algorithm based on SVD as a form of change of basis: JPEG image compression using FFT basis, or yet the Wavelet-based image compression (wavelet basis). A good ref: http://cklixx.people.wm.edu/teaching/m2999-3f.pdf
Information Retrieval: large document repository often make use of SVD to improve its searching capability (usually named LSI, or Latent Semantic Indexing)
Noise removal in large dataset (see this point later)
Collaborative Filtering: recommended system used by commercial website (books, movies...) uses SVD to derive information about purchase patterns of customers to give recommendation to other similar customer. Netflix contest probably gave the greatest boost to SVD popularity!
For Spatial data analysis. In fields such atmospheric ocean and meteorology, SVD is the “most widely-used multivariate statistical” tools (ref. http://iridl.ldeo.columbia.edu/dochelp/StatTutorial/SVD/)

In this post, I discuss the underlying theory of SVD, its various interpretation and properties. I conclude by mentioning briefly two usages in analytics or data mining field. For those like me needing a refresher on linear algebra essential for understanding SVD, I highly recommend the lectures given by Prof Strang and freely available through MIT Open courseware (video link). For those interested in its application in relation to Data mining, go check David Skillicorn's very insighful and practical book named "Understanding Complex Datasets".

The Math

Consider the matrix A of size nXp (rowXcol) having n objects (or instances) characterized by p measured attributes (or variables). [Note in Lin. Alg. we normally use nXm (Amxn) but in more applied field we usually have quite a lot more rows than attributes and it is custom to use n for indicating #of observations and p the variables (so that p << n)].

Now, SVD's goal is to find some some vectors v in Rp spanning the rowspace of A and all orthogonal to each other (i.e. v Є R(A)). These vectors v, when multiplied by the matrix A, results to a vector u in Rn which are all orthogonal to each other By convention, v’s are taken ortho-normal. Hence the vectors v form an orthogonal basis of the Rowspace of A (aka domain of A) while the vectors u form an orthogonal basis of the Colspace of A (aka range of A).

In the most general form, A is a real matrix, so we could find at most p orthogonal vector v's (dimension of the basis in Rp) as well as n orthogonal vector u's (dimension of the basis in Rn). Next diagram illustrates this general form :

However, not all of them are relevant as the matrix A may have a rank r lower than either p or n. The rank of a matrix rank(A) corresponds to the number of independent row vectors or column vectors (both are the same). Because there can only be r independent components, the SVD decomposition can only give r relevant SV's, so there will be (p-r) useless SV's all equal to 0. This reduced form can be represented as a vector form like:

A = ( u1 u2 ... ur | ur+1 .. un)nXn ( sv1 sv2 ... svr | 0r+1 0r+1 ... 0p )pXp (vt1 vt2 ... vtr | vtr+1 ... vtp )

Or similarly using the same schematic diagram, we can show this reduced form as:

Now in practice, we have deep Matrix (n>>p many more instances than attributes), so that r=p. The decomposition can then only yield p orthogonal v's leading to p relevant u's.

Conversely, in situation with much wider Matrix (n<<p, as with text mining with documents represent n instances and words being p attributes). These now give r = n, and only n vectors v's are relevant (the rest will be in the N(A) with 0 SV's).

Using Matrix form to express the decomposition gives the formula:

A V = U Σ

where Σ is a diagonal matrix containing factors which, in effect, normalizes all u’s. These factors called Singular Values (SV) are found in the diagonal entries of matrix Σ. Rearranging a bit, and noticing V-1 = Vt as V is composed of orthonormal vectors (the inverse of an orthogonal matrix is equal to its transpose), gives the following :

A = U Σ Vt

This is SVD decomposition where the original Matrix A is now factorized into U (holding n instance of objects) and weighted by SV (Σ being diagonal have all zeros entries except on diagonal), and the pXp matrix Vt (holding the m orthogonal components).

SVD re-expresses the original data set (our Matrix A of rank r) in compact and orthogonal form by using:

r orthonormal instance vector rows taken in rowspace of A, i.e. the vector v's (aka right singular vector)

r orthonormal attribute vector columns taken in columspace of A, i.e. the vector u's (aka the left singular vector)

In other word, SVD produces a new orthogonal expression form which takes into account both correlation of variables and correlation of instances. This re-expression has also a notion of importance (the SVs value) so that one can easily drop less important dimensions (smaller SVs) and end up working in a smaller, more compressed and more palatable data set form.

Let's focus on the first scenario as most typically found in data mining app. With the p vector v, we have the first r coming from the dimension of rowspace of A and the rest (p-r) would be taken from Nullspace of A (N(A)) with corresponding SV=0. However this is not needed in usual Data mining setting, as we are in front of a very long and narrow A, so r=p.

Consequently it’s nearly impossible to have perfectly correlated attributes (linearly dependent) so that not a single attribute could be replaced by a linear combination of the others. In other words, the very long column vectors contained in col of A are all independent and each attributes contribute to bringing new information (in practice we would obviously ignore an attribute that is purely a combination of the other).

Relationship with PCA

Working out a bit the equation by multiplying both side by At and its equivalent (U Σ Vt)t will give either the correlation Matrix of attributes (the dot product of long attributes vector) or the correlation Matrix of objects (the dot product of short objects vector) and will eliminate either U or V depending whether we multiply on the left:

At A = V2 Vt

and we multiply on the right:

A At = U Σ2 Ut

So U and V matrix are actually eigen vectors matrices of both correlation Matrix. Any symmetric matrix B produces Eigenvalues that are real, and Eigenvector matrix Q that are always orthogonal : B = Q ⋀ Qt with ⋀ being the Eigenvalues diagonal matrix. In our case, V corresponds to Eigenvectors of the correlation Matrix of attributes (At A) while U to Eigenvectors of the correlation Matrix of objects (A At).

This important point relates SVD with the PCA technique (Principal Component Analysis) often used to reduce dimensionality in data set.

Actually, this dimension reduction happens when taking the first v1 yielding the largest singular values or equivalently the longest vector u1 before its normalization. This is easily seen when we notice that each entries of u1 corresponds to the projected length value between the vectors ai (rows of A) and the vector v1 (simple dot-product). Hence v1 gives the new axis along which as much variance is preserved: v1 is the PCA-1! All others vs ranked by SV’s magnitude correspond to the other ordered PCA’s.

PCA considers the first few Eigenvectors of the covariance matrix (in order to work on normalized attribute) according to magnitude of Eigenvalues. When done this way PCA analyses along the attributes only (not along instances) yielding a decomposition with a smaller number of transformed attributes (orthogonal to each other) which preserve as much variance among original attributes. Although PCA can operate on correlation Matrix of objects (A At) it must be done in a different decomposition, as such SVD is more general as it analyses the dataset simultaneously in terms of inter-variables and inter-instances correlation!

Dimension Compression/Truncation (lowering the Rank!)

One of the goal when applying SVD is to get a more compact view of the data which is as close as possible to the original set. We know that choosing r relevant SV’s (rank of A) will result in a perfect decomposition where U Σ Vt matches exactly A. However, can can also rank SV’s by their magnitudes and decide to just pick the first k ones, so that the other r-k are considered much too small to represent any important data structure. So when we select and retain only some parts of the decomposition (dimensions), we actually produce a truncated version of A with a simplified data structure that concentrate the most variation.

Proceeding with this truncation can be done to discard what can be considered as nose, or else because the discarded parts are considered to come from processes we do not intend to keep. Refer to the Component interpretation below which is well suited to view the SVD as a way to decompose the data set into its underlying independent generation processes.

Hence, the SVD is usually performed to reduce the dimensions of the original dataset while providing the best reconstitution of the original matrix by a matrix with a lower rank.

The Interpretation

There are a number of ways one can view or interpret the SVD decomposition. These arise from the various way we can proceed with Matrices multiplication. It follows that according to how these multiplications are viewed, we can extract from SVD some insights and particular findings.

Given the general matrix multiplication form, where only k SV’s are considered out of r possibles dimensions. The associative rule of Matrix allows us to either multiply the S V Σ diagonal matrix first with the matrix U, or alternatively with the transpose of V (V’ in the diagram corresponds to Vt).

Now considering this general view, we can consider different interpretations.

The Geometric interpretation:

Considering each instance (row of A) as a vector in m-dimensional space, the decomposition will transform the original coordinates (the normal basis in Rm) into a new set of coordinates given in each row of Vt. Each instance is now expressed in this new set of coordinates as given by the entries in U.

The result is that the matrix Vt operates as a rotation on the original instance vectors while the matrix Σ operates as a stretch/shrinkage on the axes.

Illustrating this with a picture, we can simply imagine a unit sphere in the new transformed coordinate (the v’s) and its shape prior to the SVD transormation :(ref. http://www.aiaccess.net )

The Factor interpretation:

In this interpretation, the factor are held in the rows of V’, so there are k factors with each being weighted by the SV’s contained in Σ. This weighting gives level of importance of each factor based on SV’s difference in values. The U matrix list all instances of objects (i.e. the rows) with each instance showing specific level for each factors (i.e. the entries or attribute value in u representing the level).

The SVD can gives us some indication of the latent factors and how each instance show different level of expressiveness in these latent factors. This is often utilized in the social sciences where, data collected in survey, record a number of “superficial properties or actions” while the goal of the survey is to discover “deeper drivers or motivators”. It is expected that the Factor will reveal these drivers that cause the recorded superficial properties to be observed (this interpretation works best small sample with correlated attribute, see Skillicorn’s book for more details and the ranking wine example).

The component interpretation:

Matrix multiplication can also operate col by row piecewise to produce intermediate matrices Ai. This is called the outer product expansion and each Ai is a matrix of Rank 1. For a matrix A with rank=r, we can have at most r of these intermediate matrices, but we’ll choose to only keep the first k most important ones. Each correspond to a component.

Within this interpretation, we assume that the data generation producing A, is actually the result of the superposition of a number of processes, which only a few are considered of interest. The SVD will discover these underlying processes as independent components, although there may not be necessarily a true correspondence between these components and the exact underlying process. In practice, such correspondence can be found especially for the larger SV’s components.

Constraint / Assumption on data

Under linear algebra theory, SVD imposes to have real Eignevalues and orthogonal Eigenvectors. This is always satisfied for real symetric Matrix, such as these matrices AtA or AAt (ref. spectral decomposition prerequisite for Eigenvalues & vectors).

However, for data analysis purposes, there is additional constraint that each variables should follow approximately a Gaussian distribution. If we have a dataset which presents a set of different clusters or none-normal shape then, from the geometry interpretation, we can see that the new axis will not be so meaningful. The new axis will go along, again, the greatest variance, which in turn may be completely missing the set of clusters present in the dataset.

It is also custom to normalize the data before proceeding with its decomposition. Why? Because the SV magnitude will give an indication as to the importance of each SVs in explaining the variance structure of the dataset. This assumes we choose normalized vectors in V's and U's (vector of unit length), but also that we first normalize the attributes in A .

When attribute within the A dataset are not centered around their mean, the first singular vector v will just point to the longest direction toward the centroid of our data set. This will also affect all remaining singular vector which must be orthogonal to this first one. So to avoid this issue of wrong representation, we normalize all attributes so that their centroid gravitate around the origin.

Also, If we had some attributes with significant difference in range, then during numerical computation these attribute would have greater influence in the factorization process. This normalization can be done in numerous way:

Taking the z-score of each attribute values (subtracting the mean and dividing by their standard deviation) so that the majority of transformed values will be in the range of -1 to 1. However, this cannot be applied when attribute's distribution are far from Gaussian

Taking rank where new values take the rank order of their original values.

other more complex forms.

Note on Categorical attribute:

Similarly we must be careful when converting categorical value to number in order not to convey artificial ordering or spurious correlation when no natural ordering exist for the category. Ways to avoid that:

Map each category to two variable located in at equal distance along a circle, or yet three variables along a sphere (when number of categories is larger)

Similar but more complex Mapping into a corner of a generalized tetrahedron (ex. http://arxiv.org/pdf/0711.4452.pdf)

The Application in DM : Data reduction and explanatory

One popular application of SVD in data analysis is to reduce complexity and size of any dataset. The method will allow us to discover a smaller number of factors that can be related to the larger set of original attributes. This is a useful step first toward a better understanding of the inter-correlation structure between the attribute and observations.

Similarly, we can choose voluntarily to ignore some less important components, as these may be considered redundant and offer no important structure in our data analysis. Or yet, we may want to visualize our dataset in a 2 or 3-D view in the best possible way.

A word of caution: in data mining setting involving very large dataset, this method is not so useful as there will be many attributes partially correlated to each others in complex and subtle way. Hence, the discovered factors become hardly interpretable

The Application in DM : Noise reduction

Noise in dataset usually origin and combine from a number of multitude source, which results in adding Gaussian-like variability in our dataset, which has a net effect of increasing the apparent dimensionality of the dataset.

Let’s say we have a few variables which turn out to be highly correlated, adding noise to these observed variables will in fact obscure or the correlation pattern and making the dataset to appear at higher dimensionality than it really is.

The ordering nature of the SV’s value allow us to eliminate the later dimensions offering very little or no structure compared to the prior ones. All SV’s after a certain threshold can then be considered as noise, the challenge is to estimate this threshold (few methods exist, refer to Skillicorn book).

Martin