## Friday, November 5, 2010

### Student's t-test again 2

This is the second in a series of posts about Student's t test. The previous post is here. If you want to do a t-test for real data, I would urge you to use either R or SciPy.

### R

R is great software. It was written by statisticians for statistical work and is thoroughly tested. My only problem is that I find it very difficult to write R code. R was not designed for text or character manipulation, where Python excels. I've been very impressed with how well the GUI works on Mac OS X.

Note: I posted about the t-test in R before (here, here and here).

### SciPy

SciPy is a library that provides all kinds of goodies useful for scientific applications. It has statistical functions too, and lots of 'em, e.g. the two sample t test . If your goal is to use Python for statistical programming, that is probably where you should go. I posted about my difficulties installing SciPy: here, here and here. But I finally got it to work, even on my office computer, described briefly here. Resist the urge to "roll your own."

You could also try Sage. It should be much easier to install than SciPy (a binary with everything but the kitchen sink). However, I could only find elementary statistics in a quick search.

### PyCogent

Our goal here is to understand how the tests work. It's not complicated. In order to do this I started with the statistical functions included as part of PyCogent. I have a bunch of posts about PyCogent here, including the one on the Two sample t-test in Python here.

If you don't want to install the whole PyCogent package (or you just want to take the modules apart like I did) you can download (download source) and modify three modules from the source: these are test.py, distribution.py and special.py. You can find them in /PyCogent-1.4.1/cogent/maths/stats. I copied them out onto my desktop, and then I stripped out everything I could, while still allowing my test of t_two_sample to run.

Let's look at the names defined in the stripped-down modules:

 >>> def show(module):... for n in dir(module):... if not '_' == n[0]:... print n... >>> import test>>> show(test)arrayasarraydivisionisinfisnanmeansqrtsumt_hight_lowt_one_observationt_one_samplet_pairedt_tailed_probt_two_sampletprobvar>>> import distribution>>> show(distribution)MACHEPPIatanbetaidivisionexpsqrtstdtrstdtrit_hight_lowtprob>>> import special>>> show(special)GPGQGammaGamma_smallMACHEPMAXGAMMAXLOGMINLOGbetaibetai_resultdivisionexpfloorlogpolevlpseriessinsqrt

The modules are hierarchical. At the top level (test.py) are the functions we would call from our scripts, including t_two_sample. These call down into distribution.py to stdtr, which is the standard t distribution. This distribution is in turn computed using the integral of the incomplete beta function (betai) and the Gamma function in special.py.

stdtr is a cumulative distribution function. I used a trick to compute the pdf from it, and then plot the pdf for various values of the degrees of freedom (df). This figure is at the top of the post. As you can see, the smaller the df, the "fatter" the tails on the distribution. I made a second plot for the cdf.

script.py

 import mathimport numpy as npimport matplotlib.pyplot as pltfrom distribution import stdtrPI = math.pi # needed by stdtrdx = 0.01X = np.arange(-4,4+dx,dx)colors = 'rkkkm'plot_pdf = Truefor i,df in enumerate([1,2,5,10,30]): cdf = [stdtr(df,x) for x in X] pdf = [0] for j in range(1,len(cdf)): pdf.append(cdf[j] - cdf[j-1]) Y2 = np.array(pdf) Y2 /= sum(Y2) if plot_pdf: plt.plot(X,Y2,color=colors[i],lw=2) else: plt.plot(X,cdf,color=colors[i],lw=2)plt.savefig('example.png')

ttest.py

 # two sample t-testsimport numpy as npimport test as statsdef oneTrial(n,f=np.random.normal): N = 50 # num of individual tests SZ = n*2*N # need this many nums draw = f(loc=50,scale=3,size=SZ) counter = 0 for i in range(0,SZ,n*2): nums1 = draw[i:i+n] nums2 = draw[i+n:i+2*n] t, prob = stats.t_two_sample(nums1,nums2) if prob < 0.05: counter += 1 return 1.0*counter / Nn = 3 # sample sizeL = list()for i in range(10): #if i and not i % 10: print 'i =', i L.append(oneTrial(n))print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),print 'min %3.4f, max %3.4f' % (min(L), max(L))

output from the second script:

 avg 0.0515, std 0.0094 min 0.0300, max 0.0720

The output shows that on the average, the t statistic is < 0.05 about 5% of the time, as expected. Compare to the results from here.

Still to come: computing the t distribution, and setting up the tests.