Python for Bioinformatics

Wednesday, November 10, 2010

EMBOSS

I made a stab at building EMBOSS ~~but failed~~ and succeeded, I think.

According to the adminguide, it requires zlib and libpng as well as gd. Since EMBOSS is set up for Linux, I will probably have to edit the config file to show where these libraries are on my machine.

I installed zlib and libpng to get matplotlib up and running (according to Gavin Huttley's wiki instructions).

Let's get gd. Downloaded gd-2.0.35.tar.gz from this page. According to the readme, it's the usual:

./configure
make install

(using sudo). And it has a problem:

gdft.c:1403:35: error: fontconfig/fontconfig.h: No such file or directory
gdft.c:1466: error: expected '=', ',', ';', 'asm' or '__attribute__' before '*' token
gdft.c:1479: error: expected ')' before '*' token
gdft.c: In function 'font_pattern':
gdft.c:1516: error: 'FcPattern' undeclared (first use in this function)
gdft.c:1516: error: (Each undeclared identifier is reported only once
gdft.c:1516: error: for each function it appears in.)
gdft.c:1516: error: 'font' undeclared (first use in this function)
gdft.c:1517: error: 'FcChar8' undeclared (first use in this function)
gdft.c:1517: error: 'file' undeclared (first use in this function)
gdft.c:1518: error: 'pattern' undeclared (first use in this function)
gdft.c:1525: warning: assignment makes pointer from integer without a cast
gdft.c:1531: error: expected ')' before 'FcChar8'
gdft.c:1538: error: 'FC_FILE' undeclared (first use in this function)
gdft.c:1538: error: 'FcResultMatch' undeclared (first use in this function)
make[1]: *** [gdft.lo] Error 1
make: *** [install-recursive] Error 1

It seems to be related to a library called Fontconfig.

Get fontconfig-2.8.0.tar.bz2 from here. According to INSTALL do:

./configure --sysconfdir=/etc --prefix=/usr --mandir=/usr/share/man

but it fails because there is no configure, why not? From a web search it seems that

autoconf processes configure.in to produce a configure script

so I try this:

$ autoconf
configure.in:36: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.in:38: error: possibly undefined macro: AM_MAINTAINER_MODE
configure.in:59: error: possibly undefined macro: AM_CONFIG_HEADER
configure.in:64: error: possibly undefined macro: AC_LIBTOOL_WIN32_DLL
configure.in:65: error: possibly undefined macro: AM_PROG_LIBTOOL
configure.in:78: error: possibly undefined macro: AM_CONDITIONAL

and then do as before:

$ sudo ./configure --sysconfdir=/etc --prefix=/usr --mandir=/usr/share/man
./configure: line 1728: syntax error near unexpected token `fontconfig,'
./configure: line 1728: `AM_INIT_AUTOMAKE(fontconfig, 2.8.0)'

It looks like I'm supposed to use autoconf, but first I have to fix the above error.

After another web search, try adding this to configure.in

m4_pattern_allow([AM_INIT_AUTOMAKE])
m4_pattern_allow([AM_MAINTAINER_MODE])
m4_pattern_allow([AM_CONFIG_HEADER])
m4_pattern_allow([AC_LIBTOOL_WIN32_DLL])
m4_pattern_allow([AM_PROG_LIBTOOL])
m4_pattern_allow([AM_CONDITIONAL])

autoconf goes fine. Now:

$ sudo ./configure --sysconfdir=/etc --prefix=/usr --mandir=/usr/share/man

./configure: line 1736: syntax error near unexpected token `fontconfig,'
./configure: line 1736: `AM_INIT_AUTOMAKE(fontconfig, 2.8.0)'

And I'm stuck. You can't feed fontconfig to AM_INIT_AUTOMAKE.No EMBOSS for me, it seems.

[UPDATE: I found instructions for installing GD here. It turns out I already had /usr/X11R6/include/fontconfig. The test of GD looks fine.

Followed simple instructions for EMBOSS and did:

./configure
make
make install

And after searching around for 10 minutes I finally ran:

/usr/local/share/EMBOSS/jemboss/runJemboss.sh

Now I have a window to play with (graphic at top of post)! ]

Tuesday, November 9, 2010

Normal distribution construction in Python

Here is an example with the normal distribution that will seem trivial after the t-distribution (here).

The basic form of the normal is exp {-x²/2}. We define that as a Python function f(x), vectorize it, and construct an array X of discrete points from -10 to +10 with interval dx = 0.001. We apply the vectorized function to the array to get the relative densities. So that we obtain the correct area under the curve, we multiply the height (value of f(x)) of each piece by its width, dx. When we sum up all the pieces, the total is equal to √2π as seen in the printout. We divide by this value to normalize the distribution so that its total area is equal to 1 and it becomes a pdf.

The form of the normal that includes a term for the standard deviation is normal(x). Everything is as before, except we substitute z for x, and at the end we find that the normalizing constant is 1/σ√2π. We plot it to have something pretty to look at.

[UPDATE:
In my discussions of probability distributions of late I've played a little fast and loose with terminology. The pdf (probability density function) has a value for any x. At x = μ the exponential term is equal to 1, and the value is 1/σ√2π. For the standard normal, this is about 0.4.

>>> 1.0/sqrt(2*pi)
0.3989422804014327

Of course, this is somewhat misleading, since the probability that x = any particular point approaches zero, because x is a real number with infinitely many values. The way we actually use the pdf is to ask what is the probability for a particular window, a range of values for a < x < b, and at least conceptually, this is done by integrating the function between these limits. That's where the discrete version that I developed comes in handy. We evaluate p(x) for a small enough interval and then multiply by the interval size to get a probability for that slice. To integrate between limits, just add the included slices. But for this to work, we have to "normalize" the pdf so that the total of all the little rectangles of width dx is equal to 1. It also helps to have the discrete version in developing the cdf (cumulative distribution function), since we can just accumulate the pdf as we move along the values of X generating the cdf as we go.

Technically, the pdf(x) is the slope of the cdf(x), which gets around the issue mentioned above for a continuous function. ]

output:

2.507 2.507
5.013 5.013

code listing:

from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt

@np.vectorize
def f(x):
return math.e**(-0.5*(x**2))

@np.vectorize
def normal(x,mu=0,sigma=1):
z = (x-mu)/sigma
return math.e**(-0.5*(z**2))
#==================================
dx = 0.001
X = np.arange(-10,10+dx,dx)
pdf = f(X)
pdf *= dx
print round(sum(pdf),3),
S = math.sqrt(2*math.pi)
print round(S,3),
pdf /= S

sigma = 2
pdfn = normal(X,0,sigma)
pdfn *= dx
print round(sum(pdfn),3),
Sn = math.sqrt(2*math.pi) * sigma
print round(Sn,3),
pdfn /= Sn
#==================================
plt.plot(X,pdf,color='r',lw=4)
plt.plot(X,pdfn,color='k',lw=4)
ax = plt.axes()
ax.set_xlim(-6,6)
m = max(pdf)
ymin,ymax = -m/100,m*1.1
ax.set_ylim(ymin,ymax)
plt.text(3,0.75*ymax,s='$\sigma = 1$',
color='r',fontsize=24)
plt.text(3,0.65*ymax,s='$\sigma = 2$',
color='k',fontsize=24)
plt.savefig('example.png')

One more thing

One thing I forgot to do is test the two_sample_t function. Here we draw samples of different sizes (m=2,n=4) from distributions A and B (normal with the same mean and standard deviation). Since A and B have the same mean, the expected difference is 0. We record how frequently p < 0.05 and it is, as expected, about 5%. The corresponding t-statistic is -2.133, which is also expected, for one-tailed application of the t-test with 4 degrees of freedom.

output:

0.0491 -2.133
0.0516 -2.133
0.0524 -2.133
0.0506 -2.133
0.0502 -2.133

code listing:

from __future__ import division
import numpy as np
from t_test import two_sample_t

mu,sigma = 10,3
m,n = 2,4
alpha = 0.05

N = 10000
tmax = -N
for i in range(5):
    counter = 0
    for j in range(N):
        A = np.random.normal(mu,sigma,m)
        B = np.random.normal(mu,sigma,n)
        t,p = two_sample_t(A,B)
        if p < alpha:  
            counter += 1
            if t > tmax:
                tmax = t
    print '%3.4f' % (counter/N), 
    print round(tmax,3)

Student's t-test again 8

wikimedia

Top level: tails and errors

This is the eighth and last in a series Student's t test. The first five links are here here, plus here and here.

If you're just looking for some code, it is in the sixth post.

What's left? Error checking is important. You should look at the PyCogent module test.py for examples of that. I want to focus today on "tails."

A central idea of the frequentist approach to statistics is that the population mean and standard deviation are fixed but of unknown value. Some consequences:

• we must pick the p-value we will consider significant in advance of seeing the data (in practice, nearly all biologists use 0.05, so that p < 0.05 is signficant. The reason is that this is the largest p-value they think they can get away with!)

• we can specify in which direction we expect an expected difference to lie

Let's expand the second point. The one-sample t-test requires input of a sample and an expected mean. If we specify in advance of seeing the data that the expected mean is either lower or higher than the sample mean, we can use a "one-tailed test" where the t-statistic is significant if a value of 0.05 or less is obtained from the t-distribution. If we are uncertain or agnostic about the direction of the change, then the t-statistic is tested against 0.025, so that the difference between the sample mean and the expected mean must be larger to be significant. In my examples, I used the first choice (because it was simpler to program), as can be seen from the R versions which specified "alternative='less'."

So, to actually use my versions of the t-test, you will need top-level code to handle the tails.

Another issue that looks simple but could be tricky to code is whether the t-statistic is positive or negative.

In the one-sample t-test, if the expected mean is larger than the sample mean, then t < 0, since we do np.mean(A) - mu. If we had specified that it would be smaller than the sample mean, then we're OK ("alternative='less'"). If we specified it to be larger, then we need to switch the sign of t. Similarly with the paired t-test.

The last point is that we've made the (unspoken) assumption for the two-sample t-test that the variances are the same for the two populations from which samples were drawn. We should explicitly state whether this is the case ("var.equal=TRUE" in R), and if it is not, then the test becomes more complicated and the values from the function as given here will not be correct.

Monday, November 8, 2010

Student's t-test again 7

Interlude: calculating the sum of squares

This is the seventh in a series of posts about Student's t test. The last post has a list of all the previous posts here.

The PyCogent two-sample t-test uses an interesting method to calculate the sum of squares (X - mean(X))**2. Here is my transcription and a comparison to a more usual approach:

from __future__ import division
import numpy as np
np.random.seed(153)

def pycogent_sumsq(X):
    return sum(X**2) - sum(X)**2 / len(X)

def my_sumsq(X):
    X_bar = np.mean(X)
    squares = (X - X_bar)**2
    return sum(squares)

for i in range(5):
    X = np.random.normal(0,2,5)
    print pycogent_sumsq(X), my_sumsq(X)

We do five trials and compare the results.

15.0864873521 15.0864873521
14.2528488226 14.2528488226
8.86248689793 8.86248689793
22.3695874949 22.3695874949
37.0783435439 37.0783435439

Clearly, it works. The question is, how?
Using the code listed at the bottom of the post, examine the calculation.

pycogent_sumsq
X                               3.003   0.99 -2.319  5.699  -0.03
X**2                            9.019   0.98  5.379 32.483  0.001
sum(X**2)                       47.86
sum(X)                           7.34
sum(X)**2                       53.92
sum(X)**2/len(X)                10.78
sum(X**2) - sum(X)**2/len(X)    37.08
pycogent_sumsq(X)               37.08

my_sumsq
X - X_bar                       1.535 -0.478 -3.788  4.231 -1.499
(X - X_bar)**2                  2.355  0.229 14.348   17.9  2.247
sum((X - X_bar)**2)             37.08
my_sumsq(X)                     37.08

But I am stuck in the process of figuring out why they are equivalent.

(1) sum(X**2)  =   x₁² + x₂² + ...
(2) sum(X)**2/n = (x₁ + x₂ + ... )² / n

m = mean(X)     = sum(X) / n
                = (x₁ + x₂ + ... ) / n

(1) - (2)       = (x₁² + x₂² + ... ) - m (x₁ + x₂ + ... )
                = x₁² - mx₁ + x₂² - mx₂ + ...
                = x₁(x₁ - m) + x₂(x₂ - m) + ...

I don't seem to be able to reach:

                     = (x₁ - m)² + (x₂ - m)² + ...

Any ideas?

[UPDATE: I got an answer from Srikant Vadali here, and will discuss it sometime soon.]

Remainder of the code listing:

def show(s,f):
    print s.ljust(30),
    try:
     print str(round(f,2)).rjust(6)
    except TypeError:
        for n in f:
         print str(round(n,3)).rjust(6),
        print

print
print 'pycogent_sumsq'
show('X', X)
show('X**2', X**2)
show('sum(X**2)', sum(X**2))
show('sum(X)', sum(X))
show('sum(X)**2', sum(X)**2)
show('sum(X)**2/len(X)',sum(X)**2/len(X))
show('sum(X**2) - sum(X)**2/len(X)', sum(X**2) - sum(X)**2 / len(X))
show('pycogent_sumsq(X)',pycogent_sumsq(X))
print
print 'my_sumsq'
X_bar = np.mean(X)
show('X - X_bar', X - X_bar)
show('(X - X_bar)**2', (X - X_bar)**2)
show('sum((X - X_bar)**2)', sum((X - X_bar)**2))
show('my_sumsq(X)', my_sumsq(X))

Student's t-test again 6

Code

This is the sixth in a series of posts about Student's t test. The previous posts are here, here, here, here, and here.

In this post I'm going to talk about how I would test my home-grown functions to do the t-test in Python. I'm only going to do this for the one-sample test, because I'm running out of energy. But you could easily extend this to the other functions.

We could check using SciPy if it is installed, and that would be easy. But I'm going to use R, partly because it's the gold standard for statistics, and partly because it's a useful technique that I've used in my real work. While we could use RPy as I mentioned the other day, here I am going to use R to generate the sample data, run the test of interest and then write the results to disk.

What we're going to do is run R in batch mode. If it gives an error, which frequently happens when building such a test, we want to show the error. We also clean up extra files when we're done. The code to do this has been added to utils.py and is given below. It should be self-explanatory.

When we run R this way, we feed it a text file with the commands to run. This file is constructed by the function write_R_code. It's not pretty, but I hope you get the idea. Perhaps a better method, which I've also used, is to write the code to a text file, then load it into Python, substitute the file names and specific functions etc., then write that to disk as the file to give to R.

In the last step, we load the results that R has given us and evaluate how well our functions do on the same data. As you can see, we match very well as a rule, but have rounding errors at the extremes of the t-distribution. I believe this could be fixed by extending the range over which we built the distribution (see here). I think it looks pretty good.

[UPDATE: Just to make it clear what we did: we do 1000 runs and look at the details for the first five, then test the rest for deviation from the R results by more than ε = 0.0001. ]

Output:

run
(1626, 0)
clean
.RCode removed
.RCode.Rout removed
.RData removed
.RHistory not found

test results:
N =  1000
R gave:  t = -1.858651
I get:   t = -1.85865011386

R gave:  t = -4.250644
I get:   t = -4.25063601226

R gave:  t = -1.743078
I get:   t = -1.74307933578

R gave:  t = 0.1286087
I get:   t = 0.128612751636

R gave:  t = -9.36424
I get:   t = -9.36429217504

error with:
[ 8.946167  8.941678  8.940817] -1843.026 -1843.23012185

error with:
[ 8.895525  8.686258  8.761813] -52.60937 -52.6095381366

error with:
[ 10.54477  10.42989  10.66126] -21.77912 -21.7796812829

error with:
[ 11.05575  10.89208  10.68675] -10.50883 -10.5090054298

error with:
[ 10.02618  10.07471  10.16116] -48.45138 -48.4527009573

R_test.py

import numpy as np
from t_test import one_sample_t
import utils

def write_R_code(D):
    L = [ 'fn <- file("' + D['results_fn'] + '","w")',
          'n = ' + str(D['n']),
          'mu = ' + str(D['mu']),
          'sigma = ' + str(D['sigma']),
          'for (i in 1:' + str(D['N']) + ') {',
          'A = rnorm(n,mu,sigma)',
          'result = ' + D['func'] + str(D['mu_est']) + ')',
          'cat(A,file = fn,sep = "\n")',
          'cat(result$statistic,file = fn,sep="\n")',
          'cat("\n",file = fn)',
          '}',
          'close(fn)'
          ]
    FH = open(D['code_fn'], 'w')
    FH.write('\n'.join(L))
    FH.close()
    
def compare_to_my_result(L):
    print 'test results:'
    print 'N = ', len(L)
    for i,e in enumerate(L):
        A = np.array(e[:-1])
        result = one_sample_t(A,mu=12)
        if i < 5:
            print 'R gave:  t =', e[-1]
            print 'I get:   t =', result[0]
            print
        else:
            if e[-1] - result[0] > 0.0001:
                print 'error with:'
                print A, e[-1], result[0]
                print

if __name__ == '__main__':
    directory = '/Users/telliott_admin/Desktop/'
    D = { 'code_fn':directory + '.RCode',
          'results_fn':directory + 'results.txt',
          'func':'t.test(A,alternative="less",mu=',
          'n':3,'mu':10,'sigma':2,'mu_est':12,'N':1000 }
                  
    write_R_code(D)
    utils.runR(D['code_fn'],v=True)
    utils.cleanup_R_files(D['code_fn'],directory,v=True)
    L = utils.load_data(D['results_fn'])
    compare_to_my_result(L)

utils.py

from __future__ import division
import os,subprocess
import numpy as np

def unbiased_var(X):
    n = len(X)
    sample_SS = sum(X**2) - sum(X)**2 / n
    return sample_SS/ (n-1)
    
def unbiased_std(X):
    return np.sqrt(unbiased_var(X))

def runR(code_fn,v=False):
    if v:  print 'run'
    cmd = 'R CMD BATCH ' + code_fn
    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)
    if v:  print sts
    if sts[1] != 0:  show_R_error()

def show_R_error():
    fn = directory + '.Rcode.Rout'
    FH = open(fn, 'r')
    msg = FH.read()
    FH.close()
    print '\n'.join(msg.split('\n')[-3:]).strip()

def cleanup_R_files(code_fn,directory,v=False):
    if v:  print 'clean'
    L = ['.RCode.Rout','.RData', '.RHistory']
    L = [code_fn] + [directory + e for e in L]
    for fn in L:
        try:
            os.remove(fn)
            if v:  print fn.split('/')[-1], 'removed'
        except OSError:
            if v:  print fn.split('/')[-1], 'not found'
    print

def load_data(results_fn):
    def convert(s):
        L = s.split('\n')
        return [float(n) for n in L]
    FH = open(results_fn, 'r')
    data = FH.read().strip()
    FH.close()
    L = data.split('\n\n')
    rL = [convert(s) for s in L]
    return rL

Student's t-test again 5

Code

This is the fifth in a series of posts about Student's t test. The previous posts are here , here, here, and here.

Last time I showed a version of the one-sample t-test in Python. As we saw, the t-statistic is constructed similar to a standard error (like the sem). It is something like a z-score using the expected value for the population mean, and the observed (unbiased) sample standard deviation, then adjusted by multiplying this "z-score" by √n. (The same function, slightly edited, is also given below).

The paired t-test is simply a one-sample t-test done on the differences between paired values, comparing them to the expected difference rather than the expected mean. The paired values arise, for example, in repeated tests of the same individuals. The idea is that variance due to differences between individuals is the same for each value in a single pair.

The two-sample t-test takes account of the expected difference in means:

diff = (np.mean(A) - np.mean(B) - expected_diff)

The code I copied (and massaged) takes the observed sample variance and turns it back into the sum of squares:

sum_sq = (var(A)*(na-1) + var(B)*(nb-1))

There's a super-duper factor which is constructed from the number of values in each sample (and the dependent degrees of freedom):

na = len(A)
nb = len(B)
df = na + nb - 2
f = (1/na + 1/nb)/df
t = diff/np.sqrt(sum_sq*f)

The t-statistic is as shown. I offer no justification for these procedures. (If you want more theory and don't like your book, start here: pdf). What we will do is test the output from these functions. Today we simply check that R gets the same t-statistic for one (or two) examples. Next time we will construct a test harness that does more. And after that we may do some more testing that shows these functions do give the correct distribution with simulated data.

I feel like I do understand the tests better after this project.

The functions for unbiased variance and standard deviation in utils.py were given last time. There's something a bit unusual about the variance calculation, which comes from PyCogent. I'll try to remember to deal with that when I talk about error checking, and "tails", in a future post. The functions here are just the basic t-statistic calculations themselves.

The source module for the t distribution:

from transcendental import stdtr

was explained here.

Here is the output if the module is run as "__main__":

t-statistic  p-value
-2.378         0.049
-6.455         0.004
-2.611         0.040
-2.574         0.021

R output using the code as given at the end of the listing (and as shown here):

> A = c(3.1,2.3,2.1,1.7)
> result = t.test(A, alternative='less',mu=3)
> result$statistic
        t 
-2.377782 
> B = c(2.1,1.8,2.7,2.4)
> result=t.test(B, alternative='less',mu=3.5)
> result$statistic
        t 
-6.454972 
> C = c(3.1,4.3,4.1,2.7)
> result=t.test(A,C,alternative='less',
+   paired=TRUE,var.equal=FALSE)
> result$statistic
        t 
-2.611165 
> result=t.test(A,C,alternative='less',var.equal=TRUE)
> result$statistic
        t 
-2.573993 
>

code listing:

from __future__ import division
import numpy as np
from transcendental import stdtr
from utils import unbiased_var as var
from utils import unbiased_std as std

def one_sample_t(A,mu):
    n = len(A)
    df = n-1
    z = (np.mean(A) - mu) / std(A)
    t = z * np.sqrt(n)
    return t, stdtr(df,t)

def paired_t(A,B,expected_diff=0):
    return one_sample_t(A - B,expected_diff)

def two_sample_t(A,B,expected_diff=0):
    diff = (np.mean(A) - np.mean(B) - expected_diff)
    na = len(A)
    nb = len(B)
    df = na + nb - 2
    sum_sq = (var(A)*(na-1) + var(B)*(nb-1))
    f = (1/na + 1/nb)/df
    t = diff/np.sqrt(sum_sq*f)  
    return (t, stdtr(df,t))
#==========================================
def test(func,D):
    if func == one_sample_t:
        result = one_sample_t(D['X'],D['mu'])
    elif func == paired_t:
        result = paired_t(D['X'],D['Y'])
    elif func == two_sample_t:
        result = two_sample_t(D['X'],D['Y'])
    if D['verbose']:  
        print '%5.3f         %5.3f' % result
    return result
    
if __name__ == '__main__':
    print 't-statistic  p-value'
    
    A = np.array([3.1,2.3,2.1,1.7])
    B = np.array([2.1,1.8,2.7,2.4])
    test(one_sample_t,{'X':A,'mu':3,'verbose':True})
    test(one_sample_t,{'X':B,'mu':3.5,'verbose':True})
    
    C = np.array([3.1,4.3,4.1,2.7])
    test(paired_t,{'X':A,'Y':C,'verbose':True})
    test(two_sample_t,{'X':A,'Y':C,'verbose':True})

'''
R code:
A = c(3.1,2.3,2.1,1.7)
result = t.test(A, alternative='less',mu=3)
result$statistic
B = c(2.1,1.8,2.7,2.4)
result=t.test(B, alternative='less',mu=3.5)
result$statistic
C = c(3.1,4.3,4.1,2.7)
result=t.test(A,C,alternative='less',
  paired=TRUE,var.equal=FALSE)
result$statistic
result=t.test(A,C,alternative='less',var.equal=TRUE)
result$statistic
'''

Friday, November 5, 2010

Student's t-test again 4

One Sample t-test

This is the fourth in a series of posts about Student's t test. The previous posts are here , here and here.

There are several different versions of the t-test, but the simplest is the one sample test. We will have a single sample with a (relatively) small number of values. We calculate the mean and the variance and then the standard deviation of the sample values. Importantly, the variance and standard deviation are the unbiased versions, in which the sum of squares is divided by n-1 rather than n.

It's easy to see that we should not use numpy's var for this test:

>>> import numpy as np
>>> A = [1,2,3]
>>> np.var(A)
0.66666666666666663

As help(np.var) indicates:

    Notes
    -----
    The variance is the average of the squared deviations from the mean,
    i.e.,  var = mean(abs(x - x.mean())**2).  The computed variance is biased,
    i.e., the mean is computed by dividing by the number of elements, N,
    rather than by N-1.

In the code below we do:

    z = np.mean(A) - mu 
    z /= unbiased_std(A)

So, at least conceptually, at this point we have a z-score for the sample mean. With a standard z-score, we would compare it to the normal distribution. Here we do two things differently: we multiply by √n (the square root of the number of samples), and we find the value of the resulting t statistic in the t distribution.

t-statistic  p-value
-2.378         0.049

The result indicates that, given a prior choice of a limit of 0.05 (for this one-sided test), the null hypothesis that the mean of the sample values is equal to 3 is not supported (just barely).

You can read much more about the background of the test here.

The second set of values in the code below gives:

t-statistic  p-value
-6.455         0.004

It's reassuring that a One Sample t-test in R gives a similar result. I'll have more to say about "tails" and sidedness in another post.

> A = c(3.1,2.3,2.1,1.7)
> t.test(A, alternative='less',mu=3)

 One Sample t-test

data:  A 
t = -2.3778, df = 3, p-value = 0.04891
alternative hypothesis: true mean is less than 3 
95 percent confidence interval:
     -Inf 2.992811 
sample estimates:
mean of x 
      2.3 

> B = c(2.1,1.8,2.7,2.4)
> t.test(B, alternative='less',mu=3.5)

 One Sample t-test

data:  B 
t = -6.455, df = 3, p-value = 0.003771
alternative hypothesis: true mean is less than 3.5 
95 percent confidence interval:
     -Inf 2.705727 
sample estimates:
mean of x 
     2.25 

>

Obviously, we could use a lot more testing.

code listing:

from __future__ import division
import numpy as np
from transcendental import stdtr
from utils import unbiased_std

def one_sample_t(A,mu):
    n = len(A)
    df = n - 1
    z = np.mean(A) - mu 
    z /= unbiased_std(A)
    t = z * np.sqrt(n)
    return t, stdtr(df,t)

def test(A,mu):
    result = one_sample_t(A,mu)
    print 't-statistic  p-value'
    print '%5.3f         %5.3f' % result
    
if __name__ == '__main__':
    A = np.array([3.1,2.3,2.1,1.7])
    B = np.array([2.1,1.8,2.7,2.4])
    test(A,mu=3)
    test(B,mu=3.5)

'''
R code:
A = c(3.1,2.3,2.1,1.7)
t.test(A, alternative='less',mu=3)
B = c(2.1,1.8,2.7,2.4)
t.test(B, alternative='less',mu=3.5)
'''

utils.py

from __future__ import division
import numpy as np

def unbiased_var(X):
    n = len(X)
    sample_SS = sum(X**2) - sum(X)**2 / n
    return sample_SS/ (n-1)
    
def unbiased_std(X):
    return np.sqrt(unbiased_var(X))

Student's t-test again 3

This is the third in a series of posts about Student's t test. The previous posts are here and here.

While looking at the code in PyCogent I noticed a reference to "Cephes." The Cephes Mathematical Library (here)

It contains lots and lots of functions (like the t distribution and the incomplete beta and so on). The analogous functions in PyCogent appear to be just straight translations of the C code.

I found a Python module that makes the Cephes library available directly. It was put together by Michiel de Hoon (here, old home page here), when he was at the University of Tokyo. I've also looked at his clustering software before (here).

According to the last page, he's currently at Columbia University. Just two seconds ago, I saw this, some kind of manual he's written for doing statistics in Python. Whoaah! Looks like it should be very interesting!

Anyhow, I downloaded the source and did

$python setup.py config

ERROR: unknown if big-endian or little-endian
edit the config file before compiling

The config file looks like this

/* config.h file created by setup.py script Fri Nov  5 15:55:52 2010
 * run on darwin */ 
#define UNK 1
#define BIGENDIAN UNKNOWN /* replace with 0 or 1 */

I modified it appropriately and then did as the instructions say:

python setup.py build
python setup.py install

test by:

>>> from transcendental import ndtr 
>>> ndtr(2.) - ndtr(-2.)
0.95449973610364158

I figured the C version should be a lot faster, and it is:

$ python -m timeit -s 'from transcendental import stdtr;  import random' 't = random.random();  stdtr(3,t)'
1000000 loops, best of 3: 1.26 usec per loop

$ python -m timeit -s 'import distribution;  from distribution import stdtr;  import random' 't = random.random();  stdtr(3,t)'
10000 loops, best of 3: 24.2 usec per loop

But ... it doesn't seem to be the slow step in the t test code.

Calculating the cdf and testing the result

I looked up the definitions for the t distribution's pdf and cdf in wikipedia (here).

I translated the x-dependent part into Python and "vectorized" and used it to constuct a pdf. It's important to set really wide limits for the vector in this step, otherwise numerous small probability values are discarded, which then leads to small errors in the eventual product.

Since the normalization constants (gamma functions) depend on df but not on x, I was able to normalize the pdf simply by doing:

pdf /= sum(pdf)

I also used a very small step size in the Numpy array. The second section of the code listing ends by printing values from the distributions for whole numbers between -5 and 5:

pdf
-5.000  0.00422
-4.000  0.00916
-3.000  0.02297
-2.000  0.06751
-1.000  0.20675
0.000  0.36755
1.000  0.20675
2.000  0.06751
3.000  0.02297
4.000  0.00916
5.000  0.00422

calculated cdf
-5.000  0.00770
-4.000  0.01401
-3.000  0.02884
-2.000  0.06970
-1.000  0.19560
0.000  0.50018
1.000  0.80460
2.000  0.93037
3.000  0.97118
4.000  0.98600
5.000  0.99231

As a test, I show that I get very similar values using stdtr from either PyCogent or Cephes:

stdtr from transcendental
-5.000  0.00770
-4.000  0.01400
-3.000  0.02883
-2.000  0.06966
-1.000  0.19550
0.000  0.50000
1.000  0.80450
2.000  0.93034
3.000  0.97117
4.000  0.98600
5.000  0.99230

stdtr from PyCogent
-5.000  0.00770
-4.000  0.01400
-3.000  0.02883
-2.000  0.06966
-1.000  0.19550
0.000  0.50000
1.000  0.80450
2.000  0.93034
3.000  0.97117
4.000  0.98600
5.000  0.99230

And that all matches what I get from R. So I'm pretty confident that I have the distributions correct and the functions are working as they should.

> A = seq(-5,5,by=1)
> pt(A,3)
 [1] 0.007696219 0.014004228 0.028834443 0.069662984
 [5] 0.195501109 0.500000000 0.804498891 0.930337016
 [9] 0.971165557 0.985995772 0.992303781

code listing:

import numpy as np
import distribution    # also imports special
import transcendental

dx = 0.001
X = np.arange(-100,100+dx,dx)

@np.vectorize
def p(df,x):
    base = 1 + x**2/df
    exponent = (df+1)/2
    return base**(-exponent)

df = 3
pdf = p(df,X)

# normalization depends on df but not x
# so this works
pdf /= sum(pdf)
#=============================================

def show(F,multiplier=1):
    N = 95000
    for n in range(N,len(X)-N,1000):
        t = X[n], multiplier*F[n]
        print '%5.3f  %6.5f' % t
    print

# we include a multiplier for the pdf only
print 'pdf'     
show(pdf,multiplier=1000)

cdf = [pdf[0]]
for n in pdf[1:]:
    cdf.append(n + cdf[-1])
print 'calculated cdf'
show(cdf)
#=============================================

@np.vectorize
def f(x):
    return transcendental.stdtr(df,x)

cdf2 = f(X)
print 'stdtr from transcendental'
show(cdf2)

@np.vectorize
def f(x):
    return distribution.stdtr(df,x)

cdf3 = f(X)
print 'stdtr from PyCogent'
show(cdf3)

Student's t-test again 2

This is the second in a series of posts about Student's t test. The previous post is here. If you want to do a t-test for real data, I would urge you to use either R or SciPy.

R

R is great software. It was written by statisticians for statistical work and is thoroughly tested. My only problem is that I find it very difficult to write R code. R was not designed for text or character manipulation, where Python excels. I've been very impressed with how well the GUI works on Mac OS X.

Note: I posted about the t-test in R before (here, here and here).

SciPy

SciPy is a library that provides all kinds of goodies useful for scientific applications. It has statistical functions too, and lots of 'em, e.g. the two sample t test . If your goal is to use Python for statistical programming, that is probably where you should go. I posted about my difficulties installing SciPy: here, here and here. But I finally got it to work, even on my office computer, described briefly here. Resist the urge to "roll your own."

You could also try Sage. It should be much easier to install than SciPy (a binary with everything but the kitchen sink). However, I could only find elementary statistics in a quick search.

PyCogent

Our goal here is to understand how the tests work. It's not complicated. In order to do this I started with the statistical functions included as part of PyCogent. I have a bunch of posts about PyCogent here, including the one on the Two sample t-test in Python here.

If you don't want to install the whole PyCogent package (or you just want to take the modules apart like I did) you can download (download source) and modify three modules from the source: these are test.py, distribution.py and special.py. You can find them in /PyCogent-1.4.1/cogent/maths/stats. I copied them out onto my desktop, and then I stripped out everything I could, while still allowing my test of t_two_sample to run.

Let's look at the names defined in the stripped-down modules:

>>> def show(module):
...     for n in dir(module):
...         if not '_' == n[0]:
...             print n
... 
>>> import test
>>> show(test)
array
asarray
division
isinf
isnan
mean
sqrt
sum
t_high
t_low
t_one_observation
t_one_sample
t_paired
t_tailed_prob
t_two_sample
tprob
var
>>> import distribution
>>> show(distribution)
MACHEP
PI
atan
betai
division
exp
sqrt
stdtr
stdtri
t_high
t_low
tprob
>>> import special
>>> show(special)
GP
GQ
Gamma
Gamma_small
MACHEP
MAXGAM
MAXLOG
MINLOG
betai
betai_result
division
exp
floor
log
polevl
pseries
sin
sqrt

The modules are hierarchical. At the top level (test.py) are the functions we would call from our scripts, including t_two_sample. These call down into distribution.py to stdtr, which is the standard t distribution. This distribution is in turn computed using the integral of the incomplete beta function (betai) and the Gamma function in special.py.

stdtr is a cumulative distribution function. I used a trick to compute the pdf from it, and then plot the pdf for various values of the degrees of freedom (df). This figure is at the top of the post. As you can see, the smaller the df, the "fatter" the tails on the distribution. I made a second plot for the cdf.

script.py

import math
import numpy as np
import matplotlib.pyplot as plt
from distribution import stdtr

PI = math.pi    # needed by stdtr
dx = 0.01
X = np.arange(-4,4+dx,dx)
colors = 'rkkkm'
plot_pdf = True

for i,df in enumerate([1,2,5,10,30]):
    cdf = [stdtr(df,x) for x in X]
    pdf = [0]
    for j in range(1,len(cdf)):
        pdf.append(cdf[j] - cdf[j-1])
    Y2 = np.array(pdf)
    Y2 /= sum(Y2)
    if plot_pdf:
        plt.plot(X,Y2,color=colors[i],lw=2)
    else:
        plt.plot(X,cdf,color=colors[i],lw=2)

plt.savefig('example.png')

ttest.py

# two sample t-tests
import numpy as np
import test as stats

def oneTrial(n,f=np.random.normal):
    N = 50      # num of individual tests
    SZ = n*2*N    # need this many nums
    draw = f(loc=50,scale=3,size=SZ)
    counter = 0
    for i in range(0,SZ,n*2):
        nums1 = draw[i:i+n]
        nums2 = draw[i+n:i+2*n]
        t, prob = stats.t_two_sample(nums1,nums2)
        if prob < 0.05:  counter += 1
    return 1.0*counter / N

n = 3             # sample size
L = list()
for i in range(10):
    #if i and not i % 10:  print 'i =', i
    L.append(oneTrial(n))

print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),
print 'min %3.4f, max %3.4f' % (min(L), max(L))

output from the second script:

avg 0.0515, std 0.0094 min 0.0300, max 0.0720

The output shows that on the average, the t statistic is < 0.05 about 5% of the time, as expected. Compare to the results from here.

Still to come: computing the t distribution, and setting up the tests.

Thursday, November 4, 2010

Student's t-test again 1

Periodically, I've wondered about whether anyone reads these posts. In a meta sort of way, I've even posted about it.

As ABOUT ME says, a major objective here is to provide a record so that "google can organize my head." (bbum now says "index"---well, OK)

Still, one wonders. And I have to admit that I'm NOT repeat NOT an expert on most of these topics. I'm just a student, and a true believer in Python. Although I'm sure Ruby would satisfy me as well (if I hadn't met Python first), but I think the possible advantage would be marginal.

And, please, if you find an error here (or even something interesting), speak up.

So I was excited to discover this AM that blogger has stats. You can't see my stats, but I can. And the stats say that 166 people loaded pages from my blog today. 166 != 0. That's great. A perennial winner is a post about t-tests.

That smells like homework. Nevertheless, it put a thought in my tiny brain, and it won't go away. So I started thinking about the t-test: the what, the why, the how, and the Python.

Note: I posted about the t-test in R before (here, here and here).

Anyway, I'm going to look into the subject a bit more seriously. But for starters, let's model the basic problem. If we sample repeatedly from a population of known mean and variance, then the accuracy of the observed means (the standard error of the mean) depends markedly on sample size. The normal distribution of the set of means is guaranteed by the miraculous Central Limit Theorem (here, and wikipedia).

The t-test deals with samples of small size, which are ubiquitous in classical biology. And even in microarray studies, if you consider the data gene-by-gene. Of course there is more to it, for example, whether the size of the samples is the same, and their individual variances.

Here is a simple Python simulation that shows the problem. The first set of samples is from a normal distribution, the second from a uniform distribution. You can clearly see the dependence of the standard deviation of the sample means on the sample size. The column labeled stdev is the computed standard deviation for the observed means, and the column labeled sem is the predicted standard deviation based on the population standard deviation and n:
sem = σ / √n.

num     mean     stdev       sem
 2     10.09      3.56      3.54
 3      9.99      2.86      2.89
 4     10.03      2.53      2.50
 5      9.97      2.23      2.24
10     10.01      1.58      1.58
15     10.02      1.29      1.29
20     10.02      1.11      1.12
25     10.01      1.00      1.00
30     10.00      0.91      0.91

 2     10.01      4.08
 3     10.02      3.33
 4     10.00      2.90
 5     10.01      2.57
10      9.99      1.80
15     10.00      1.48
20      9.99      1.28
25     10.00      1.16
30     10.01      1.06

Code listing:

import numpy as np
import random, math

N = int(1e6)
mu = 10
sigma = 5
A_normal = np.random.normal(loc=mu,scale=sigma,size=N)
A_uniform = np.random.uniform(low=0,high=mu*2,size=N)

def sample(A,SZ):
    rL = list()
    for j in range(10000):
        L = [random.choice(A) for k in range(SZ)]
        rL.append(np.mean(L))
    return (np.mean(rL),np.std(rL))

def show(A,extra=True):
    R = range(2,6) + range(10,31,5)
    for SZ in R:
        t = sample(A,SZ)
        pL = ['%2d' % SZ,
            '%5.2f' % t[0],
            '%5.2f' % t[1]]
        if extra:
            pL.append('%5.2f' % (sigma/math.sqrt(SZ)))
        print (' '*5).join(pL)
    print

print 'num     mean     stdev       sem'
show(A_normal)
show(A_uniform,extra=False)

R from Python, baby steps

I've been playing around with RPy a bit this morning. As the main page says:

rpy2 is a redesign and rewrite of rpy. It is providing a low-level interface to R, a proposed high-level interface, including wrappers to graphical libraries, as well as R-like structures and functions.

I just used easy_install

$ easy_install rpy2
Searching for rpy2
Reading http://pypi.python.org/simple/rpy2/
Reading http://rpy.sourceforge.net
Best match: rpy2 2.1.7
Downloading http://pypi.python.org/packages/source/r/rpy2/rpy2-2.1.7.tar.gz#md5=e8e8db05f13644ce04784888156af471
Processing rpy2-2.1.7.tar.gz
...

error: /Library/Python/2.6/site-packages/easy-install.pth: Permission denied

For some reason, root was the owner of the .pth file. So I changed it, and then got:

Using /Library/Python/2.6/site-packages/rpy2-2.1.7_20101104-py2.6-macosx-10.6-universal.egg
Processing dependencies for rpy2
Finished processing dependencies for rpy2

The example I chose to run was described in more detail in this post. If we run it from R, it looks like this:

> library(Bolstad)
Warning message:
package 'Bolstad' was built under R version 2.10.1 
> result = binobp(68,200,1,1)
Posterior Mean           :  0.3415842 
Posterior Variance       :  0.0011079 
Posterior Std. Deviation :  0.0332852 

Prob. Quantile 
------ ---------
0.005 0.2591665
0.01 0.2666906
0.025 0.2779134
0.05 0.287724
0.5 0.3410604
0.95 0.3972323
0.975 0.4082264
0.99 0.4210788
0.995 0.4298666
> result$mean
[1] 0.3415842
> class(result$mean)
[1] "numeric"
>

In R, the variable result is a list of numeric vectors with names:
$posterior, $likelihood, $prior, $pi (990 elements each), $mean, $var, $sd, $quantiles.

In the Python interpreter:

>>> import rpy2.robjects as robjects
>>> robjects.r['pi'][0]
3.1415926535897931
>>> 
>>> from rpy2.robjects.packages import importr
>>> importr('Bolstad')
Warning message:
package 'Bolstad' was built under R version 2.10.1 

>>> 
>>> binobp = robjects.r['binobp']
>>> result = binobp(68,200,1,1)
Posterior Mean           :  0.3415842 
Posterior Variance       :  0.0011079 
Posterior Std. Deviation :  0.0332852 

Prob. Quantile 
------ ---------
0.005 0.2591665
0.01 0.2666906
0.025 0.2779134
0.05 0.287724
0.5 0.3410604
0.95 0.3972323
0.975 0.4082264
0.99 0.4210788
0.995 0.4298666

and it opens X11 (rather than Quartz, not sure why) to do the plot. Getting the individual values from the result is a slight pain, but not too bad:

>>> result
<Vector - Python:0x100544290 / R:0x100c6c610>
>>> L = str(result).split('\n')
>>> L[0]
'$posterior'
>>> L[1][:35]
'  [1]  1.516820e-80  8.663662e-78  '
>>> result[0][0]
1.5168201745820013e-80

To get the names of the vectors, we need to parse str(result) after splitting on double newlines. If you already know the index of the value you want you can just grab it directly as shown. And of course, that's better, since the value is a float rather than a string.

[UPDATE: As the first comment says, using names is the way to do this. Docs here. And see examples in later posts.]

>>> L = str(result).split('\n\n')
>>> str(L[0]).split('\n')[0]
'$posterior'
>>> str(L[0]).split('\n')[1][:35]
'  [1]  1.516820e-80  8.663662e-78  '
>>> str(L[4]).split('\n')[:2]
['$mean', '[1] 0.3415842']
>>> result[4][0]
0.34158415841584161
>>> type(result[4][0])
<type 'float'>

Wednesday, November 3, 2010

On the value of silicon

Computers are no more a threat to mathematicians than food
processors are a threat to cooks.

Andrei Okounkov as quoted here

Poisson simulation

This is a very simple simulation of the Poisson distribution using Python.

First a function event() is defined which returns either 1 or 0, with 1 occuring at frequency f (here f = 0.1). A list is constructed by calling event() N times.

Then the list is chopped into groups using a method described and discussed previously (here and at Stack Overflow here).

In the example, there are 10 events in each group, so the expectation for the average number of successes is 1. This should generate a Poisson distribution with mean λ = 1.

The Counter class is used to evaluate the results. It is included in Python 2.7 but not in my 2.6 installation. Then we output the results for various k:

The problem I'm having is that the results are not quite correct. We obtain the expected values when n = 9 rather than n = 10. If f is made smaller (say f = 0.01) it looks better, but still the expected distribution is more closely approximated by n = 99 than n = 100.

I'd be grateful for any ideas about what's wrong!

[UPDATE: I think the problem here is the same as I had the other day. The Poisson is an approximation. If we do int(1e7) events in groups of n=1000, it looks as expected. ]

from itertools import izip_longest
import numpy as np
import Counter

def event():
    f = 0.1
    r = np.random.random()
    if r <= f:  return 1
    return 0

N = int(1e5)
L = [event() for i in range(N)]

def groups(iterable, n=3, padvalue=0):
    "groups('abcde', 3, 'x') --> ('a','b','c'), ('d','e','x')"
    return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

rL = [sum(g) for g in groups(L,n=10)]
C = Counter.Counter(rL)
for i in range(max(C.keys())+1):
    print str(i).rjust(2), C[i]

Tuesday, November 2, 2010

Don't forget D.C.

I spend much more time on politics than I really should. More precisely, on trying to "suss out" the remarkably large differences between what politicians profess to believe and what they actively support. Also, to try to understand which policies are good ones and what the prospects are for those to be enacted.

One of my favorite sources is Nate Silver's blog FiveThirtyEight, even though he's been absorbed into the currently undistinguished national newspaper of record.

The question arises: why 538?

Because that is the number of electors in the Electoral College. OK. And why is that not equal to the number of Congress critters (435) + Senators (100)? It isn't because of the non-voting delegates (American Samoa, the District of Columbia, the Virgin Islands, Guam, and the Northern Mariana Islands) and one Commissioner (Puerto Rico).

It's because the 23rd Amendment to the U.S. Constitution says:

wikimedia

Saturday, October 30, 2010

Additive variance for addition and subtraction

I've learned that

"for the sum or difference of independent random variables, variances add"

(source)

Here is a Python simulation with numbers drawn from a normal distribution with mean = 0 and std = 2. As expected, the actual variance observed for a sample is not exactly equal to 4, though it approaches this as N increases. What was a surprise to me is that the statement "variances add" is only approximately true for actual data. (That is, unless I've screwed up somehow). Looking back, I saw this before but ignored it.

Here are trials for various N, showing the mean, variance and standard deviations of the samples, the sums of the actual variances, and the variances of the sum and difference.

[UPDATE: I handled the printout of N's value poorly. For example, in the interpreter:

>>> 10e2
1000.0

for N = 100, when what I actually did in code was 10**2. ]

    mean       var       std

N = 10e2
   -0.009     3.957     1.989  data1
    0.339     3.192     1.787  data2
    0.330     7.149     3.776  total
add
    0.330     7.101     2.665  combo
sub
   -0.348     7.197     2.683  combo
------------------------------
N = 10e3
    0.066     3.727     1.931  data1
    0.009     4.027     2.007  data2
    0.076     7.754     3.937  total
add
    0.076     7.753     2.784  combo
sub
    0.057     7.755     2.785  combo
------------------------------
N = 10e4
   -0.025     4.007     2.002  data1
    0.041     4.089     2.022  data2
    0.015     8.097     4.024  total
add
    0.015     8.087     2.844  combo
sub
   -0.066     8.106     2.847  combo
------------------------------
N = 10e7
    0.000     4.001     2.000  data1
    0.001     3.999     2.000  data2
    0.001     8.000     4.000  total
add
    0.001     8.001     2.829  combo
sub
   -0.000     7.998     2.828  combo
------------------------------

import numpy as np

def show(stats,name=None):
    L = [' %3.3f ' % e for e in stats]
    print ''.join([e.rjust(10) for e in L]),
    if name:  print name
    else:     print 

def get_stats(data):
    funcs = [np.mean,np.var,np.std]
    return np.array([f(data) for f in funcs])

def init(N):
    total = np.zeros(3)
    f = np.random.normal
    rL = list()
    for i in range(2):
        rL.append(f(loc=0,scale=2,size=N))
        stats = get_stats(rL[-1])
        show(stats,'data'+str(i+1))
        total += stats
    show(total,'total')
    return rL
    
def add(A1,A2):  return A1 + A2
def sub(A1,A2):  return A1 - A2
#---------------------------------------
for e in [2,3,4,7]:
    N = 10**e
    print 'N = 10e' + str(e)
    data1, data2 = init(N)
    for f in [add,sub]:
        print f.func_name
        combo = f(data1,data2)
        show(get_stats(combo),'combo')
    print '-'*30

Friday, October 29, 2010

Another head-scratcher

I know about keywords in Python. You can't do this:

>>> and = 3
  File "<stdin>", line 1
    and = 3
      ^
SyntaxError: invalid syntax

But I do like simple names for objects and functions. I acquired this habit from Stroustrup's book on the C++ language, 10 years ago, before I even knew about Python.

However, I must remember not to make the names too familiar. For example, yesterday I rashly named a script new.py. And this single line caused problems that had me scratching my head:

import matplotlib.pyplot as plt

Apparently matplotlib also uses the name new for a module (although I haven't actually located it):

Traceback (most recent call last):
  File "new.py", line 1, in 
    import matplotlib.pyplot as plt
  File "/Library/Python/2.6/site-packages/matplotlib/pyplot.py", line 6, in 
    from matplotlib.figure import Figure, figaspect
  File "/Library/Python/2.6/site-packages/matplotlib/figure.py", line 18, in 
    from axes import Axes, SubplotBase, subplot_class_factory
  File "/Library/Python/2.6/site-packages/matplotlib/axes.py", line 2, in 
    import math, sys, warnings, datetime, new
  File "/Users/telliott_admin/Desktop/new.py", line 1, in 
    import matplotlib.pyplot as plt
AttributeError: 'module' object has no attribute 'pyplot'

[UPDATE:

new.py is not a matplotlib module, it's in the standard library, but it's deprecated. It provides a way to instantiate a new object without calling __init__. I'm not sure what situation you would want to use it for other than what it says in the docs.

There is one use of new in axes.py but I don't understand it:

_subplot_classes = {}
def subplot_class_factory(axes_class=None):
    # This makes a new class that inherits from SubclassBase and the
    # given axes_class (which is assumed to be a subclass of Axes).
    # This is perhaps a little bit roundabout to make a new class on
    # the fly like this, but it means that a new Subplot class does
    # not have to be created for every type of Axes.
    if axes_class is None:
        axes_class = Axes

    new_class = _subplot_classes.get(axes_class)
    if new_class is None:
        new_class = new.classobj("%sSubplot" % (axes_class.__name__),
                                 (SubplotBase, axes_class),
                                 {'_axes_class': axes_class})
        _subplot_classes[axes_class] = new_class

    return new_class

Wednesday, October 27, 2010

More on sampling a distribution

Some time ago I had a post about sampling from a distribution. More specifically, the problem is to generate random samples given a cdf (cumulative distribution function). There have been a couple of useful comments, and I'd like to extend the explanation as well.

Previously, a function was defined that can be used to generate a probability for any value of a random variable x, from a probability density function or distribution (pdf) with a given mean and standard deviation. We remember that for a continuous distribution, the actual probability at any discrete x is zero, since the total number of possible x's is infinite. Technically the definition is that the pdf is the derivative of the cdf, which I misremembered as the cumulative density function. So the cdf(x) is the area under the pdf from negative infinity to x.

One nice thing about this approach is it is then easy to define a sum of weighted distributions.

We get what looks like a smooth curve by plotting a (relatively) large number of points (in this example, 3502). The plotting uses matplotlib (see this post referencing set-up on the Mac). The cdf is computed by simply accumulating values from the pdf. Normalization is usually done by dividing by the total, but the method I showed was just slightly more subtle:

cdf *= dx

.

Unfortunately, it is also wrong! (Sorry). I won't try to explain what I was thinking, but the fact that it gave an accurate value (as shown by the maximum of the cdf being equal to 1) is an accident, and you should instead simply divide by the sum of the values, and moreover, do the operation on the pdf before constructing the cdf:

pdf /= sum(pdf)

This change to the pdf means we need to magnify it before plotting, and really, should provide a different y-axis on the right hand side, with the true values. The left-hand y-axis only is accurate only for the cdf.

The idea for sampling is to generate a random float using np.random.random(), and then ask which of the values in the cdf (which by definition are ordered), first exceeds this value. The indexes resulting from repeating this procedure are concentrated in the steep part of the cdf (the peaks of the pdf), because the probability that a given position in our "discretized" form of the cdf satisfies this relationship is proportional to the slope of the cdf curve (the added vertical distance between a given index and the one previous).

As a reader suggested, an improvement to the code is to recognize that the list we're searching (the cdf) is ordered and so can more efficiently be searched using a binary search. The code is a little tricky to write, so I skipped it last time. And luckily, Python comes with "batteries included" and for this application what we want is the bisect module from the standard library. The example find_le function's docstring says: 'Find rightmost value less than or equal to x'. I just modified this code (which calls bisect.bisect_right) to return the index rather than the value.

We're interested in intervals where the slope is steepest. The original find_first function returned the index of the right-hand value, while this function will return the index of the left-hand value. I suppose either one is fine, but perhaps it would be better to use the midpoint.

There are a few more steps in the code that are a little obscure, including bins with fractional width, and the use of the Counter class to organize the data for the histogram. But this post is getting a bit long so I'll skip them for now.

Modified code:

import math, sys, bisect
import numpy as np
import matplotlib.pyplot as plt
import Counter

def normal(mu,sigma):
    def f(x):
        z = 1.0*(x-mu)/sigma
        e = math.e**(-0.5*z**2)
        C = math.sqrt(2*math.pi)*sigma
        return 1.0*e/C
    return f

p1 = normal(0,2)
p2 = normal(10,1)
p3 = normal(18,0.5)

# sum of weighted normal distributions
@np.vectorize
def p(x):
    return 0.5*p1(x) + 0.25*p2(x) + 0.25*p3(x)

dx = 0.01
xmax = 25    
R = np.arange(-10,xmax+dx,dx)
# dashed lines
plt.plot((R[0],R[-1]),(0,0),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(1,1),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(-0.5,-0.5),color='k',lw=2)

pdf = p(R)
S = sum(pdf)
pdf /= S
plt.plot(R,pdf*(len(R)/10.0),color='r',lw=3)
#print len(R)
S = sum(pdf)
print S
#===========================================
cdf = [pdf[0]]
for e in pdf[1:]:
    cdf.append(cdf[-1] + e)
cdf = np.array(cdf)
#cdf /= S
plt.plot(R,cdf,color='b',lw=3)

ax = plt.axes()
ax.set_xlim(-6,xmax)
ax.set_ylim(-0.55,1.05)
#===========================================
def find_le(n,L):
    'modified from bisect.find_le'
    i = bisect.bisect_right(L,n)
    if i:
        return i
    raise ValueError

samples = list()
width = 0.2
f = 1/width
for i in range(10000):
    n = np.random.random()
    # must adjust to actual range
    value = find_le(n,cdf)*dx - 10.0
    # trick to truncate at fractional values
    samples.append(np.floor(f*value)/f)
#===========================================
c = Counter.Counter(samples)
maxn = c.most_common(1)[0][1]
for k in c:
    n = c[k]
    n = n * 0.45 / maxn
    r = plt.Rectangle((k,-0.5),
                      width=width,height=n,
                      facecolor='magenta')
    ax.add_patch(r)
plt.savefig('example.png')

Thursday, October 21, 2010

Progressive politics

Mark Kleiman had a post this morning in which he says

Evolution is just like global warming: each has such overwhelming scientific evidence behind it that denial of either one is strong evidence that the speaker is either ot-nay oo-tay ight-bray or blinded by some prejudice.

This would be a typical statement from "the reality-based community" but it glosses over a point which should be remembered.

Evolution as a theory is on a par with the theory of atoms. The weight of evidence is literally overwhelming. Imagine that TV-personality like Glenn Beck declared his non-acceptance of the theory of atoms. Obviously, no one would care. So why do the media care when he says the equivalent about evolution? It makes no sense from a reality-based perspective, it's just political theater. The really odd thing about evolution is that the average person seem to believe that the evidence can be evaluated by amateurs like himself.

Climate change is more complicated. The evidence that the earth's temperature is rising is unequivocal, according to the experts. The real question at issue is: what part of that change is due to human activity? And perhaps more important for me, what is the prospect for changes in policy in the U.S. or around the world that would have an impact on our carbon footprint.

I remember seeing a video featuring Steven Chu in which he said that his reading of the experts was that there's about an 80% chance that the change is due to human activity. Unfortunately, I can't find the link now, so here's another video in which he gives an estimate of a 50% chance of a 3°C temperature rise in this century and talks about what that would mean. In short, chaos.

With respect to policy, raw political advantage trumps good policy for the republicans (at least right now), so we will likely not achieve anything on this front, even by mechanisms like "cap and trade" that were originally republican "market-driven" ideas. Also, of the estimated roughly 900 billion tons of coal in the world, a substantial fraction is in China, and I expect them to burn all of that.

Saturday, October 16, 2010

Is Heinlein a Bayesian?

Brad DeLong had a post yesterday about Robert Heinlein, who I read extensively as a teenager and enjoyed tremendously. The subject was The Moon Is A Harsh Mistress, which I thought was a great story. I like Brad's blog a lot, and learn a lot about economic issues from him, but I thought the post so wrong as to make me question whether he is as knowledgable on other topics as he seems to be.

DeLong says "Heinlein [has a] failure to understand the Law of the Reverend Thomas Bayes," and then approvingly quotes another writer who complains that you can't compute the odds for an event and come up with different answers at different times. But of course this is the whole point of Bayes. If you understand nothing else about the issue, you should realize that what Bayes makes possible is to update probabilities as new evidence becomes available.

So I decided to say something in a comment, nicely I thought. I'm quite mystified to find my comment has been deleted a few hours later. I'll shrug my shoulders and go on, but I am shocked, shocked, first, that Brad DeLong is clueless about Bayes, second, that a whole string of commenters has missed the central point, and then finally that a polite comment pointing this out should be thought subversive. Very weird indeed, and disheartening.

[UPDATE: Perhaps it's just a misunderstanding. The original post says:

I have two problems with Mike. One is the figuring the odds for the revolution. I’d have bought it if he did it once. It’s the complex refiguring and odds changing and—no. People complain about the Dust hypothesis in Permutation City that you can’t calculate things out of order, and this is worse. You can’t work out odds of 7 to 1 against and then say they will keep getting worse until they get better.

The statement about "complex refiguring and odds changing and—no" is just wrong. And you can calculate in any order you like.

But it turns out Brad's objection and maybe this poster's is that they read Heinlein as saying " [the odds] will keep getting worse until they get better." You can't keep getting installments of bad news and say things are looking up.

But that's not how I remember the book. Mike knows things, everything public anyway. Some events break the revolutionists' way early, but a thousand small unmentioned things change the balance. And time is running out. Read the book, and judge for yourself. One of Heinlein's best, along with my favorite, Stranger in a Strange Land.]

[UPDATE 2:

I dug into it a little bit and I am convinced that I'm right. The original post is a very nice review of the book. Great story, but lots of not-PC stuff, characteristic Heinlein. However, it is clear that (1) Jo Walton does not understand Bayes (hence the first part of the quote above) and (2) Heinlein does, at least to the extent he knows how to bet on a horse race, and can appropriately update odds when in possession of new information.

I looked through an online version of the book (pdf) and read all the pages where the word "odds" appears, and nearby. For an example showing Heinlein's understanding, you might read the discussion about the comparative advantage of various strategies depending on the selected time frame (p. 57-58).

The sequence that Brad DeLong thinks damning, that the revolutionists' odds of success are improved after the trip to earth, is just unclear. We're never told what Mike has been "thinking" during this time, or much about what's been happening overall. The evidence used to update the odds is not given to us.

The only counterargument that I could find against my interpretation is based on this dialog on p. 110:

As soon as Stu went Earthside, Mike set odds at one in thirteen. I asked him what in hell? “But, Man,” he explained patiently, “it increases risk. That it is necessary risk does not change the fact that risk is increased."

This doesn't make any sense. But I think Heinlein was probably just seduced by the fact that it sounds a good line. And maybe he is thinking about the time-frame dependence.

The whole rest of the book makes perfect sense from a Bayesian perspective.

Now, I'm not defending Heinlein's political philosophy. There is a lot to complain about (just look at the wikipedia entry for this book). But it's still a great story. Three things irritated me (from most to least important):

• the post held up as a model of correctness was itself wrong on the fundamentals
• DeLong wrongly criticized Heinlein as a mathematical simpleton
• he deleted my comment pointing this out

Too bad no one will ever know. ]