Python for Bioinformatics: Student's t-test again 8

Tuesday, November 9, 2010

Student's t-test again 8

Top level: tails and errors

This is the eighth and last in a series Student's t test. The first five links are here here, plus here and here.

If you're just looking for some code, it is in the sixth post.

What's left? Error checking is important. You should look at the PyCogent module test.py for examples of that. I want to focus today on "tails."

A central idea of the frequentist approach to statistics is that the population mean and standard deviation are fixed but of unknown value. Some consequences:

• we must pick the p-value we will consider significant in advance of seeing the data (in practice, nearly all biologists use 0.05, so that p < 0.05 is signficant. The reason is that this is the largest p-value they think they can get away with!)

• we can specify in which direction we expect an expected difference to lie

Let's expand the second point. The one-sample t-test requires input of a sample and an expected mean. If we specify in advance of seeing the data that the expected mean is either lower or higher than the sample mean, we can use a "one-tailed test" where the t-statistic is significant if a value of 0.05 or less is obtained from the t-distribution. If we are uncertain or agnostic about the direction of the change, then the t-statistic is tested against 0.025, so that the difference between the sample mean and the expected mean must be larger to be significant. In my examples, I used the first choice (because it was simpler to program), as can be seen from the R versions which specified "alternative='less'."

So, to actually use my versions of the t-test, you will need top-level code to handle the tails.

Another issue that looks simple but could be tricky to code is whether the t-statistic is positive or negative.

In the one-sample t-test, if the expected mean is larger than the sample mean, then t < 0, since we do np.mean(A) - mu. If we had specified that it would be smaller than the sample mean, then we're OK ("alternative='less'"). If we specified it to be larger, then we need to switch the sign of t. Similarly with the paired t-test.

The last point is that we've made the (unspoken) assumption for the two-sample t-test that the variances are the same for the two populations from which samples were drawn. We should explicitly state whether this is the case ("var.equal=TRUE" in R), and if it is not, then the test becomes more complicated and the values from the function as given here will not be correct.