This is the second in a series of posts about Student's t test. The previous post is here. If you want to do a t-test for real data, I would urge you to use either R or SciPy.
R
R is great software. It was written by statisticians for statistical work and is thoroughly tested. My only problem is that I find it very difficult to write R code. R was not designed for text or character manipulation, where Python excels. I've been very impressed with how well the GUI works on Mac OS X.
Note: I posted about the t-test in R before (here, here and here).
SciPy
SciPy is a library that provides all kinds of goodies useful for scientific applications. It has statistical functions too, and lots of 'em, e.g. the two sample t test . If your goal is to use Python for statistical programming, that is probably where you should go. I posted about my difficulties installing SciPy: here, here and here. But I finally got it to work, even on my office computer, described briefly here. Resist the urge to "roll your own."
You could also try Sage. It should be much easier to install than SciPy (a binary with everything but the kitchen sink). However, I could only find elementary statistics in a quick search.
PyCogent
Our goal here is to understand how the tests work. It's not complicated. In order to do this I started with the statistical functions included as part of PyCogent. I have a bunch of posts about PyCogent here, including the one on the Two sample t-test in Python here.
If you don't want to install the whole PyCogent package (or you just want to take the modules apart like I did) you can download (download source) and modify three modules from the source: these are
test.py
, distribution.py
and special.py
. You can find them in /PyCogent-1.4.1/cogent/maths/stats
. I copied them out onto my desktop, and then I stripped out everything I could, while still allowing my test of t_two_sample
to run. Let's look at the names defined in the stripped-down modules:
The modules are hierarchical. At the top level (
test.py
) are the functions we would call from our scripts, including t_two_sample
. These call down into distribution.py
to stdtr
, which is the standard t distribution. This distribution is in turn computed using the integral of the incomplete beta function (betai
) and the Gamma
function in special.py
.stdtr
is a cumulative distribution function. I used a trick to compute the pdf from it, and then plot the pdf for various values of the degrees of freedom (df). This figure is at the top of the post. As you can see, the smaller the df, the "fatter" the tails on the distribution. I made a second plot for the cdf.script.py
ttest.py
output from the second script:
The output shows that on the average, the t statistic is < 0.05 about 5% of the time, as expected. Compare to the results from here.
Still to come: computing the t distribution, and setting up the tests.