Monday, September 03, 2012

Understanding and Using SVD with large dataset

When confronted with large and complex dataset, very useful information can be obtained by applying some form of Matrix decomposition.   One example is the Singular Value Decomposition (SVD) whose principles yielded the derivation of a number of very useful application in today's digitized world:
  • Lossly Data Compression algorithm based on SVD as a form of change of basis:  JPEG image compression using FFT basis, or yet the Wavelet-based image compression (wavelet basis). A good ref: http://cklixx.people.wm.edu/teaching/m2999-3f.pdf
  • Information Retrieval: large document repository often make use of SVD to improve its searching capability (usually named LSI, or Latent Semantic Indexing)
  • Noise removal in large dataset (see this point later)
  • Collaborative Filtering: recommended system used by commercial website (books, movies...) uses SVD to derive information about purchase patterns of customers to give recommendation to other similar customer.  Netflix contest probably gave the greatest boost to SVD popularity!
  • For Spatial data analysis.  In fields such atmospheric ocean and meteorology, SVD is the “most widely-used multivariate statistical” tools (ref.  http://iridl.ldeo.columbia.edu/dochelp/StatTutorial/SVD/)


In this post, I discuss the underlying theory of SVD, its various interpretation and properties.  I conclude by mentioning briefly two usages in analytics or data mining field.  For those like me needing a refresher on linear algebra essential for understanding SVD, I highly recommend the lectures given by Prof Strang and freely available through MIT Open courseware (video link).  For those interested in its application in relation to Data mining, go check David Skillicorn's very insighful and practical book named "Understanding Complex Datasets".


The Math

Consider the matrix A of size nXp (rowXcol) having n objects (or instances) characterized by p measured attributes (or variables).  [Note in Lin. Alg. we normally use nXm (Amxn) but in more applied field we usually have quite a lot more rows than attributes and it is custom to use n for indicating #of observations and p the variables (so that p << n)].
Now, SVD's goal is to find some some vectors v in Rp spanning the rowspace of A and all orthogonal to each other (i.e. v Є R(A)).   These vectors v, when multiplied by the matrix A, results to a vector u in Rn which are all orthogonal to each other  By convention, v’s are taken ortho-normal.   Hence the vectors v form an orthogonal basis of the Rowspace of A (aka domain of A) while the vectors u form an orthogonal basis of the Colspace of A (aka range of A).  

In the most general form, A is a real matrix, so we could find at most p orthogonal vector v's (dimension of the basis in Rp) as well as n orthogonal vector u's (dimension of the basis in Rn). Next diagram illustrates this general form :










However, not all of them are relevant as the matrix A may have a rank r lower than either p or n.  The rank of a matrix rank(A) corresponds to the number of independent row vectors or column vectors (both are the same).   Because there can only be r independent components, the SVD decomposition can only give r relevant SV's, so there will be (p-r) useless SV's all equal to 0.  This reduced form can be represented as a vector form like:
A =  ( u1 u2 ... ur | ur+1 .. un)nXn  ( sv1 sv2 ... svr | 0r+1 0r+1 ... 0p )pXp (vt1 vt2 ... vtr | vtr+1 ... vtp )

Or similarly using the same schematic diagram, we can show this reduced form as:













Now in practice, we have deep Matrix (n>>p many more instances than attributes), so that r=p.  The decomposition can then only yield p orthogonal v's leading to p relevant u's. 

Conversely, in situation with much wider Matrix  (n<<p, as with text mining with documents represent n instances and words being p attributes).  These now give r = n, and only n vectors v's are relevant (the rest will be in the N(A) with 0 SV's).
Using Matrix form to express the decomposition gives the formula:

A V = U Σ
where Σ is a diagonal matrix containing factors  which, in effect, normalizes all u’s.  These factors called Singular Values (SV) are found in the diagonal entries of matrix Σ.  Rearranging a bit, and noticing V-1 = Vt as V is composed of orthonormal vectors (the inverse of an orthogonal matrix is equal to its transpose), gives the following :


A = U Σ Vt
This is SVD decomposition where the original Matrix A is now factorized into U (holding n instance of objects) and weighted by SV (Σ being diagonal have all zeros entries except on diagonal), and the pXp matrix Vt (holding the m orthogonal components).

SVD re-expresses the original data set (our Matrix A of rank r) in compact and orthogonal form by using:
  1. r orthonormal instance vector rows taken in rowspace of A, i.e. the vector v's (aka right singular vector)
  2. r orthonormal attribute vector columns taken in columspace of A, i.e. the vector u's (aka the left singular vector)

In other word, SVD produces a new orthogonal expression form which takes into account both correlation of variables and correlation of instances.  This re-expression has also a notion of importance (the SVs value) so that one can easily drop less important dimensions (smaller SVs) and end up working in a smaller, more compressed and more palatable data set form.  
Let's focus on the first scenario as most typically found in data mining app.   With the p vector v, we have the first r coming from the dimension of rowspace of A and the rest (p-r) would be taken from Nullspace of A (N(A)) with corresponding SV=0.   However this is not needed in usual Data mining setting, as we are in front of a very long and narrow A, so r=p

Consequently it’s nearly impossible to have perfectly correlated attributes (linearly dependent) so that not a single attribute could be replaced by a linear combination of the others.  In other words, the very long column vectors contained in col of A are all independent and each attributes contribute to bringing new information (in practice we would obviously ignore an attribute that is purely a combination of the other).

Relationship with PCA

Working out a bit the equation by multiplying both side by At and its equivalent (U Σ Vt)t will give either the correlation Matrix of attributes (the dot product of long attributes vector) or the correlation Matrix of objects (the dot product of short objects vector) and will eliminate either U or V depending whether we multiply on the left:

At A = V2 Vt

and we multiply on the right:
A At = U Σ2 Ut

So U and V matrix are actually eigen vectors matrices of both correlation Matrix.  Any symmetric matrix B produces Eigenvalues that are real, and Eigenvector matrix Q that are always orthogonal : B = Q  Qt with  being the Eigenvalues diagonal matrix.   In our case, V corresponds to Eigenvectors of the correlation Matrix of attributes (At A) while U to Eigenvectors of the correlation Matrix of objects (A At).   
This important point relates SVD with the PCA technique (Principal Component Analysis) often used to reduce dimensionality in data set.  
Actually, this dimension reduction happens when taking the first v1 yielding the largest singular values or equivalently the longest vector u1 before its normalization.   This is easily seen when we notice that each entries of u1 corresponds to the projected length value between the vectors ai (rows of A) and the vector v1 (simple dot-product).   Hence v1 gives the new axis along which as much variance is preserved: v1 is the PCA-1!   All others vs  ranked by SV’s magnitude correspond to the other ordered PCA’s.
PCA considers the first few Eigenvectors of the covariance matrix (in order to work on normalized attribute) according to magnitude of Eigenvalues.   When done this way PCA analyses along the attributes only (not along instances) yielding a decomposition with a smaller number of transformed attributes (orthogonal to each other) which preserve as much variance among original attributes. Although PCA can operate on correlation Matrix of objects (A At) it must be done in a different decomposition, as such SVD is more general as it analyses the dataset simultaneously in terms of inter-variables and inter-instances correlation!


Dimension Compression/Truncation (lowering the Rank!)
One of the goal when applying SVD is to get a more compact view of the data which is as close as possible to the original set.  We know that choosing r relevant SV’s (rank of A) will result in a perfect decomposition where Σ Vt matches exactly A.  However, can can also rank SV’s by their magnitudes and decide to just pick the first k ones, so that the other r-k are considered much too small to represent any important data structure.   So when we select and retain only some parts of the decomposition (dimensions), we actually produce a truncated version of A with a simplified data structure that concentrate the most variation.  
Proceeding with this truncation can be done to discard what can be considered as nose, or else because the discarded parts are considered to come from processes we do not intend to keep.  Refer to the Component interpretation below which is well suited to view the SVD as a way to decompose the data set into its underlying independent generation processes.
Hence, the SVD is usually performed to reduce the dimensions of the original dataset while providing the best reconstitution of the original matrix by a matrix with a lower rank.

The Interpretation

There are a number of ways one can view or interpret the SVD decomposition.  These arise from the various way we can proceed with Matrices multiplication.  It follows that according to how these multiplications are viewed, we can extract from SVD some insights and particular findings.  

Given the general matrix multiplication form, where only k SV’s are considered out of r possibles dimensions.  The associative rule of Matrix allows us to either multiply the S V Σ diagonal matrix first with the matrix U, or alternatively with the transpose of V (V’ in the diagram corresponds to Vt).  





 





















Now considering this general view, we can consider different interpretations.
The Geometric interpretation:
Considering each instance (row of A) as a vector in m-dimensional space, the decomposition will transform the original coordinates (the normal basis in Rm) into a new set of coordinates given in each row of Vt.  Each instance is now expressed in this new set of coordinates as given by the entries in U.   

The result is that the matrix Vt operates as a rotation on the original instance vectors while the matrix Σ operates as a stretch/shrinkage on the axes.


Illustrating this with a picture, we can simply imagine a unit sphere in the new transformed coordinate (the v’s) and its shape prior to the SVD transormation :(ref. http://www.aiaccess.net )

The Factor interpretation:
In this interpretation, the factor are held in the rows of V’, so there are k factors with each being weighted by the SV’s contained in Σ.  This weighting gives level of importance of each factor based on SV’s difference in values.   The U matrix list all instances of objects (i.e. the rows) with each instance showing specific level for each factors (i.e. the entries or attribute value in u representing the level).   



The SVD can gives us some indication of the latent factors and how each instance show different level of expressiveness in these latent factors.  This is often utilized in the social sciences where, data collected in survey, record a number of “superficial properties or actions” while the goal of the survey is to discover “deeper drivers or motivators”.  It is expected that the Factor will reveal these drivers that cause the recorded superficial properties to be observed (this interpretation works best small sample with correlated attribute, see Skillicorn’s book for more details and the ranking wine example).  
The component interpretation:
Matrix multiplication can also operate col by row piecewise to produce intermediate matrices Ai.  This is called the outer product expansion and each Ai is a matrix of Rank 1.  For a matrix A with rank=r, we can have at most r of these intermediate matrices, but we’ll choose to only keep the first k most important ones.  Each correspond to a component.

Within this interpretation, we assume that the data generation producing A, is actually the result of the superposition of a number of processes, which only a few are considered of interest.  The SVD will discover these underlying processes as independent components, although there may not be necessarily a true correspondence between these components and the exact underlying process.   In practice, such correspondence can be found especially for the larger SV’s components.

Constraint / Assumption on data

Under linear algebra theory, SVD imposes to have real Eignevalues and orthogonal Eigenvectors.  This is always satisfied for real symetric Matrix, such as these matrices AtA or AAt (ref. spectral decomposition prerequisite for Eigenvalues & vectors).
However, for data analysis purposes, there is additional constraint that each variables should follow approximately a Gaussian distribution.  If we have a dataset which presents a set of different clusters or none-normal shape then, from the geometry interpretation, we can see that the new axis will not be so meaningful.   The new axis will go along, again, the greatest variance, which in turn may be completely missing the set of clusters present in the dataset.  
It is also custom to normalize the data before proceeding with its decomposition.  Why? Because the SV magnitude will give an indication as to the importance of each SVs in explaining the variance structure of the dataset.  This assumes we choose normalized vectors in V's and U's (vector of unit length), but also that we first normalize the attributes in A .   
When attribute within the A dataset are not centered around their mean, the first singular vector v will just point to the longest direction toward the centroid of our data set.  This will also affect all remaining singular vector which must be orthogonal to this first one.   So to avoid this issue of wrong representation, we normalize all attributes so that their centroid gravitate around the origin.
Also, If we had some attributes with significant difference in range, then during numerical computation these attribute would have greater influence in the factorization process.  This normalization can be done in numerous way:
    • Taking the z-score of each attribute values (subtracting the mean and dividing by their standard deviation) so that the majority of transformed values will be in the range of -1 to 1.  However, this cannot be applied when attribute's distribution are far from Gaussian
    • Taking rank where new values take the rank order of their original values.  
    • other more complex forms.
Note on Categorical attribute:  

Similarly we must be careful when converting categorical value to number in order not to convey artificial ordering or spurious correlation when no natural ordering exist for the category.  Ways to avoid that:
    • Map each category to two variable located in at equal distance along a circle, or yet three variables along a sphere (when number of categories is larger)
    • Similar but more complex Mapping into a corner of a generalized tetrahedron (ex. http://arxiv.org/pdf/0711.4452.pdf)
The Application in DM : Data reduction and explanatory
One popular application of SVD in data analysis is to reduce complexity and size of any dataset.  The method will allow us to discover a smaller number of factors that can be related to the larger set of original attributes.  This is a useful step first toward a better understanding of the inter-correlation structure between the attribute and observations. 

Similarly, we can choose voluntarily to ignore some less important components, as these may be considered redundant and offer no important structure in our data analysis.  Or yet, we may want to visualize our dataset in a 2 or 3-D view in the best possible way.
A word of caution:  in data mining setting involving very large dataset, this method is not so useful as there will be many attributes partially correlated to each others in complex and subtle way.   Hence, the discovered factors become hardly interpretable

The Application in DM : Noise reduction

Noise in dataset usually origin and combine from a number of multitude source, which results in adding Gaussian-like variability in our dataset, which has a net effect of increasing the apparent dimensionality of the dataset.  

Let’s say we have a few variables which turn out to be highly correlated, adding noise to these observed variables will in fact obscure or the correlation pattern and making the dataset to appear at higher dimensionality than it really is.   

The ordering nature of the SV’s value allow us to eliminate the later dimensions offering very little or no structure compared to the prior ones.  All SV’s after a certain threshold can then be considered as noise, the challenge is to estimate this threshold (few methods exist, refer to Skillicorn book).

Martin

Sunday, March 11, 2012

Computing vs mathematical derivation


The impact of computing in various part of our life is obvious.  Impacts originating from data analytical applications are probably less directly visible but surely not less important.  This is especially true in recent years where hardware allows the possibility of acquiring huge data volume, processing them with complex algorithm (machine learning) and finding pattern or solving complex non-deterministic problem.  It can literally change how research is conducted, a sort of shift in the way we make science and solve analytical problem.   This is coined with terms like Data Science revolution and described in various interesting publications like the 4th paradigm book and Data Science introduction.

We also have been able to do simulation on random process for a long time, but things got even simpler with the advent of interpreted programming language like Python.   In this post, I try to illustrate how to leverage simple programming to solve probability-based problem.    I was inspired in doing this after watching Statistics without the agonizing pain video.   In this video John Rauser is comparing the "painful" traditional way of doing statistical theory vs the computation way with simulation and permutation test.   Both ways find the same conclusion to a statistical hypothesis test, but arguably one is much simpler... this video is quite entertaining and offer a very compelling case to learn Python!

The power of computing coupled with some technical knowledge,  anyone could simulate a complex problem involving randomness (stochastic processes) without necessarily mastering the full probability and statistical theory.

Let's re-use some of the quizzes that you can find simply browsing google books.   From these quizzes, I found example where solutions are more easily, less easily or yet as easily obtained with computing versus exact mathematical derivation.  Of course, these are quite subjective and depending on one's ability in mastering each. 

1. Derivation equally easy

Let's look at problem 1.2:

1.2 A Tournament Problem
Ten players participate in the first round of a tennis tournament: 2 females and 8 males. Five single matches are fixed at random by successively drawing, without replacement, the names of all 10 players from an urn: the player drawn first plays against the one whose name comes up second, the third against the fourth, etc.

a. What is the probability that there will not be a single match involving two female players? Is this probability smaller, equal to, or larger than the corresponding probability with 20 females and 80 males? 


b. Try to answer the general case in which there are 2n players, of whom 2 k n are female. What is the probability p(k,n) that among the n matches there will not be a single one involving two female players?

Using Python, we see how easy we can "generate" an answer:


This returns:
Probability estimated from 100000 trials (group of 10) = 0.888890
Probability estimated from 100000 trials (group of 100)= 0.091830

Using 100K trials, answers will vary a bit (especially for group of 100 having a much larger input space) so we should probably capture this variation by estimating the standard deviation averaged over many experiments of many trials.   But the answer is close enough (as proved by the exact solution below) ... and available within seconds processing!

The mathematical solution for the specific case, involves calculating the number of ways one can draw 2 females for a match (F_2) vs total number of draws possible (T).   The first number is obtained by observing that there are 5 possible places where (f1,f2) can be drawn (1-2, 3-4, ...9-10) multiplied by 2 (to account for the reverse order (f2,f1)) and for each of these matches there will be 8! possible permutations of the 8 remaining player.

F_2 = 2 * 5 * 8!

The total possibilities space T is simply the number of ways we can order 10 players (without replacement), i.e. permutations of 10 players:

T = 10!

Dividing F_2 over T gives the probability to have an all-female match, so the probability of not having a single match is:

Prob(not having all-female) = 1 - F_2/T =  0.888...

Now to derive the generic formula, we need to adopt a different strategy since we no longer have a single possibility for the all-female match.   With k females the possible permutations of remaining of players will also include other all-female matches (double-counting).

Instead let's first calculate the overall set of possibilities, i.e. the denominator of our probability.  This is obtained by counting number of ways (with order) we can draw 2n player without replacement, which is equal to factorial of 2n (2n!).    This is quite a big input space typical of permutation set, and the challenge now is to find all success possibilities out of these 2n! , i.e. the numerator.  We need to be cautious not forgetting any permutations, since order is important!     

Successful permutations are the ones having maximum one female player per match.    Let's consider the n matches:
  • how many ways can we organise k matches of one female out of n matches:
    • this is the combinatory:  nCk
  • for each way, all k female players can be picked in different orders: 
    • this is the factorial : k! 
  • for each way, the remaining male players can also take different orders:
    • this is the factorial : (2n-k)! 
  • finally for each female match, each one may have been picked in position first or second, so 2 possible permutations per k match:
    • this is 2 x 2 X 2.. k times:  2k
Hence our probability of having no all-female match, is the multiplication of all successful permutations divided by the total permutations set:

p(n,k) =   nC* k! * (2n-k)! * 2k /  (2n)!

We could do some algebra and clean up this, but much easier to plug the formula in python and quickly find our exact solutions:

p(5,2) = 0.8888..
p(50,20) = 0.0922

We see that for 100 players with 20 female, as guessed earlier the computations estimation is a bit off and running more trials would be needed to improve it.


3. Derivation easier done with math


Consider the next problem where simple statistical concept allows a derivation in less time.  Here my bet is that computing time would be too costly, not programming time.

1.6 Sample Size vs Signal Strength
An urn contains six balls — three red and three blue. One of these balls — let us call it ball A — is selected at random and permanently removed from the urn without the color of this ball being shown to an observer. This observer may now draw successively — at random and with replacement — a number of individual balls (one at a time) from among the five remaining balls, so as to form a noisy impression about the ratio of red vs. blue balls that remained in the urn after A was removed.   

Peter may draw a ball six times, and each time the ball he draws turns out to be red. Paula may draw a ball 600 times; 303 times she draws a red ball, and 297 times a blue ball. Clearly, both will tend to predict that ball A was probably blue. Which of them — if either — has the stronger empirical evidence for his/her prediction? 


The problem with this simulation, is that Paula's experiment has an incredibly large set of possible outcomes.  The combinatory calculation of ways one can draw red/blue balls in a 600 draws trail is mind-boggling :  2600... quite a large input space even for nowadays computer!

We wish to know how much more likely the pattern 303 red/297blue is when we remove a red ball versus a blue ball.   To do that, we should simulate a pretty big sample so that we observe enough of these two patterns to calculate their frequency ratio, and repeat with many trials to increase our confidence in the ratio!

Although I've learnt with experience to never try predict computing performance beforehand, it seems this problem could take forever to compute...   There is probably some clever way to implement it efficiently, but I'll leave that to others... these are my notes!   Besides, the math here is quite digest:

First let's simplify the notation by defining:

Ab = a blue ball was removed initially, i.e.:   P(A="blue") = P(Ab) = 0.5
Ar  = similar to Ab but for the red ball, P(A="red") = P(Ar) = 0.5

E-peter = the event of Peter observing 6 consecutive red balls during a trial.
E-paula= the event of Paula observing 303 red balls out of 600 drawn (297 blue) during a trial.

Now to answer the question, we simply need to find which one has a larger conditional probability that blue ball was removed given each observed event, i.e. :

P(Ab / E-peter) vs P(Ab / E-paula)

From Bayes theorem, this is calculated from the probability that both event occur (joint probability) divided by the marginal probability to observe the pattern E-peter:

P(Ab / E-peter)  =  P(E-peter / Ab) * P(Ab)  /  P(E-peter)

The marginal probability to observe the pattern P(E-peter), is obtained by averaging out the two ways E-peter can be observed (first a red ball was removed or second a blue ball was removed):

P(E-peter) = P(E-peter / Ab) * P(Ab) + P(E-peter / Ar) * P(Ar)
                  =  0.5 P(E-peter / Ab) + 0.5 P(E-peter / Ar)

and similarly for E-paula.

Now the posterior probability of E-peter observing the trial pattern given a red/blue was removed) is simple enough as each draw is independent:

P(E-peter / Ab) = (3/5)6
P(E-peter / Ar) = (2/5)6

The same posterior for E-paula is a combinatorial calculation (the product of probabilities of one such event combined with how many ways it can be produced):
P(E-paula / Ab) =  600C303 * (3/5)303  * (2/5)297
P(E-paula / Ar) =  600C303 * (2/5)303  * (3/5)297

This mixes extremely large and small values... python offers a good alternative to a calculator with its support for extremely (read infinite) values .

After plugging these formulas and get each conditional posterior probability, it turns out that both have equal chance:

P(Ab / E-peter) = 0.9193
P(Ab / E-paula) = 0.9193

Although, P(E-peter) is, by far, much higher than P(E-paula) (0.025 vs 0.000000278), this does not impact the result.  Once we know these events took place, only the relative probability of occurrence between the two possibilities (whether a red ball was removed versus a blue ball) has an impact in determining whose prediction has more evidence.


Martin