Sunday, March 11, 2012

Computing vs mathematical derivation


The impact of computing in various part of our life is obvious.  Impacts originating from data analytical applications are probably less directly visible but surely not less important.  This is especially true in recent years where hardware allows the possibility of acquiring huge data volume, processing them with complex algorithm (machine learning) and finding pattern or solving complex non-deterministic problem.  It can literally change how research is conducted, a sort of shift in the way we make science and solve analytical problem.   This is coined with terms like Data Science revolution and described in various interesting publications like the 4th paradigm book and Data Science introduction.

We also have been able to do simulation on random process for a long time, but things got even simpler with the advent of interpreted programming language like Python.   In this post, I try to illustrate how to leverage simple programming to solve probability-based problem.    I was inspired in doing this after watching Statistics without the agonizing pain video.   In this video John Rauser is comparing the "painful" traditional way of doing statistical theory vs the computation way with simulation and permutation test.   Both ways find the same conclusion to a statistical hypothesis test, but arguably one is much simpler... this video is quite entertaining and offer a very compelling case to learn Python!

The power of computing coupled with some technical knowledge,  anyone could simulate a complex problem involving randomness (stochastic processes) without necessarily mastering the full probability and statistical theory.

Let's re-use some of the quizzes that you can find simply browsing google books.   From these quizzes, I found example where solutions are more easily, less easily or yet as easily obtained with computing versus exact mathematical derivation.  Of course, these are quite subjective and depending on one's ability in mastering each. 

1. Derivation equally easy

Let's look at problem 1.2:

1.2 A Tournament Problem
Ten players participate in the first round of a tennis tournament: 2 females and 8 males. Five single matches are fixed at random by successively drawing, without replacement, the names of all 10 players from an urn: the player drawn first plays against the one whose name comes up second, the third against the fourth, etc.

a. What is the probability that there will not be a single match involving two female players? Is this probability smaller, equal to, or larger than the corresponding probability with 20 females and 80 males? 


b. Try to answer the general case in which there are 2n players, of whom 2 k n are female. What is the probability p(k,n) that among the n matches there will not be a single one involving two female players?

Using Python, we see how easy we can "generate" an answer:


This returns:
Probability estimated from 100000 trials (group of 10) = 0.888890
Probability estimated from 100000 trials (group of 100)= 0.091830

Using 100K trials, answers will vary a bit (especially for group of 100 having a much larger input space) so we should probably capture this variation by estimating the standard deviation averaged over many experiments of many trials.   But the answer is close enough (as proved by the exact solution below) ... and available within seconds processing!

The mathematical solution for the specific case, involves calculating the number of ways one can draw 2 females for a match (F_2) vs total number of draws possible (T).   The first number is obtained by observing that there are 5 possible places where (f1,f2) can be drawn (1-2, 3-4, ...9-10) multiplied by 2 (to account for the reverse order (f2,f1)) and for each of these matches there will be 8! possible permutations of the 8 remaining player.

F_2 = 2 * 5 * 8!

The total possibilities space T is simply the number of ways we can order 10 players (without replacement), i.e. permutations of 10 players:

T = 10!

Dividing F_2 over T gives the probability to have an all-female match, so the probability of not having a single match is:

Prob(not having all-female) = 1 - F_2/T =  0.888...

Now to derive the generic formula, we need to adopt a different strategy since we no longer have a single possibility for the all-female match.   With k females the possible permutations of remaining of players will also include other all-female matches (double-counting).

Instead let's first calculate the overall set of possibilities, i.e. the denominator of our probability.  This is obtained by counting number of ways (with order) we can draw 2n player without replacement, which is equal to factorial of 2n (2n!).    This is quite a big input space typical of permutation set, and the challenge now is to find all success possibilities out of these 2n! , i.e. the numerator.  We need to be cautious not forgetting any permutations, since order is important!     

Successful permutations are the ones having maximum one female player per match.    Let's consider the n matches:
  • how many ways can we organise k matches of one female out of n matches:
    • this is the combinatory:  nCk
  • for each way, all k female players can be picked in different orders: 
    • this is the factorial : k! 
  • for each way, the remaining male players can also take different orders:
    • this is the factorial : (2n-k)! 
  • finally for each female match, each one may have been picked in position first or second, so 2 possible permutations per k match:
    • this is 2 x 2 X 2.. k times:  2k
Hence our probability of having no all-female match, is the multiplication of all successful permutations divided by the total permutations set:

p(n,k) =   nC* k! * (2n-k)! * 2k /  (2n)!

We could do some algebra and clean up this, but much easier to plug the formula in python and quickly find our exact solutions:

p(5,2) = 0.8888..
p(50,20) = 0.0922

We see that for 100 players with 20 female, as guessed earlier the computations estimation is a bit off and running more trials would be needed to improve it.


3. Derivation easier done with math


Consider the next problem where simple statistical concept allows a derivation in less time.  Here my bet is that computing time would be too costly, not programming time.

1.6 Sample Size vs Signal Strength
An urn contains six balls — three red and three blue. One of these balls — let us call it ball A — is selected at random and permanently removed from the urn without the color of this ball being shown to an observer. This observer may now draw successively — at random and with replacement — a number of individual balls (one at a time) from among the five remaining balls, so as to form a noisy impression about the ratio of red vs. blue balls that remained in the urn after A was removed.   

Peter may draw a ball six times, and each time the ball he draws turns out to be red. Paula may draw a ball 600 times; 303 times she draws a red ball, and 297 times a blue ball. Clearly, both will tend to predict that ball A was probably blue. Which of them — if either — has the stronger empirical evidence for his/her prediction? 


The problem with this simulation, is that Paula's experiment has an incredibly large set of possible outcomes.  The combinatory calculation of ways one can draw red/blue balls in a 600 draws trail is mind-boggling :  2600... quite a large input space even for nowadays computer!

We wish to know how much more likely the pattern 303 red/297blue is when we remove a red ball versus a blue ball.   To do that, we should simulate a pretty big sample so that we observe enough of these two patterns to calculate their frequency ratio, and repeat with many trials to increase our confidence in the ratio!

Although I've learnt with experience to never try predict computing performance beforehand, it seems this problem could take forever to compute...   There is probably some clever way to implement it efficiently, but I'll leave that to others... these are my notes!   Besides, the math here is quite digest:

First let's simplify the notation by defining:

Ab = a blue ball was removed initially, i.e.:   P(A="blue") = P(Ab) = 0.5
Ar  = similar to Ab but for the red ball, P(A="red") = P(Ar) = 0.5

E-peter = the event of Peter observing 6 consecutive red balls during a trial.
E-paula= the event of Paula observing 303 red balls out of 600 drawn (297 blue) during a trial.

Now to answer the question, we simply need to find which one has a larger conditional probability that blue ball was removed given each observed event, i.e. :

P(Ab / E-peter) vs P(Ab / E-paula)

From Bayes theorem, this is calculated from the probability that both event occur (joint probability) divided by the marginal probability to observe the pattern E-peter:

P(Ab / E-peter)  =  P(E-peter / Ab) * P(Ab)  /  P(E-peter)

The marginal probability to observe the pattern P(E-peter), is obtained by averaging out the two ways E-peter can be observed (first a red ball was removed or second a blue ball was removed):

P(E-peter) = P(E-peter / Ab) * P(Ab) + P(E-peter / Ar) * P(Ar)
                  =  0.5 P(E-peter / Ab) + 0.5 P(E-peter / Ar)

and similarly for E-paula.

Now the posterior probability of E-peter observing the trial pattern given a red/blue was removed) is simple enough as each draw is independent:

P(E-peter / Ab) = (3/5)6
P(E-peter / Ar) = (2/5)6

The same posterior for E-paula is a combinatorial calculation (the product of probabilities of one such event combined with how many ways it can be produced):
P(E-paula / Ab) =  600C303 * (3/5)303  * (2/5)297
P(E-paula / Ar) =  600C303 * (2/5)303  * (3/5)297

This mixes extremely large and small values... python offers a good alternative to a calculator with its support for extremely (read infinite) values .

After plugging these formulas and get each conditional posterior probability, it turns out that both have equal chance:

P(Ab / E-peter) = 0.9193
P(Ab / E-paula) = 0.9193

Although, P(E-peter) is, by far, much higher than P(E-paula) (0.025 vs 0.000000278), this does not impact the result.  Once we know these events took place, only the relative probability of occurrence between the two possibilities (whether a red ball was removed versus a blue ball) has an impact in determining whose prediction has more evidence.


Martin