Wednesday, May 4, 2011

Intro to ANOVA


This is an introductory post on ANOVA (analysis of variance). We ask the question: given three (or more) groups of observations, is one or more of the group means significantly different from the others. We will compute an F-statistic, and compare that with an F-distribution (carry out an F-test). If the statistic exceeds the 95% quantile, we will reject the null hypothesis that the means are the same.

According to wikipedia:


the two-group case can be covered by a t-test (Gosset, 1908). When there are only two means to compare, the t-test and the ANOVA F-test are equivalent; the relation between ANOVA and t is given by F = t2.


ANOVA is a versatile (and complex) set of methods. This is just an elementary application, where we'll use the R implementation on three simple groups of data, and then compute the result ourselves in Python to see how it works internally.

To begin with, we follow the simple example from Dalgaard. You will need R and the ISwR package (or just construct the "data frame" yourself). R code:


> library(package=ISwR)
> data(red.cell.folate)
> summary(red.cell.folate)
folate ventilation
Min. :206.0 N2O+O2,24h:8
1st Qu.:249.5 N2O+O2,op :9
Median :274.0 O2,24h :5
Mean :283.2
3rd Qu.:305.5
Max. :392.0
> red.cell.folate
folate ventilation
1 243 N2O+O2,24h
2 251 N2O+O2,24h
3 275 N2O+O2,24h
4 291 N2O+O2,24h
5 347 N2O+O2,24h
6 354 N2O+O2,24h
7 380 N2O+O2,24h
8 392 N2O+O2,24h
9 206 N2O+O2,op
10 210 N2O+O2,op
11 226 N2O+O2,op
12 249 N2O+O2,op
13 255 N2O+O2,op
14 273 N2O+O2,op
15 285 N2O+O2,op
16 295 N2O+O2,op
17 309 N2O+O2,op
18 241 O2,24h
19 258 O2,24h
20 270 O2,24h
21 293 O2,24h
22 328 O2,24h


We have 22 values in 3 groups.


> attach(red.cell.folate)
> anova(lm(folate~ventilation))
Analysis of Variance Table

Response: folate
Df Sum Sq Mean Sq F value Pr(>F)
ventilation 2 15516 7757.9 3.7113 0.04359 *
Residuals 19 39716 2090.3
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>


The attach gives us access to the names of the columns of values (folate) and factors (ventilation). We make a plot (shown at the top of the post):


> plot( folate ~ ventilation, data = red.cell.folate )
> stripchart(x=folate~ventilation,
pch=16,vertical=T,add=T,col='blue')


The value 0.04359 indicates that we have P < 0.05.

We write the data to disk, and remember that the groups have (respectively) 8, 9 and 5 values.


> setwd('Desktop')
> write.table(folate,'data.txt',row.names=F,col.names=F)


We use the Python script below to compute the F-statistic:


python script.py
..
MS_W 2090.32
MS_X 7757.88
F_stat 3.71


If you look in the R output above, you'll see the same values as given here for MS_W and MS_X and the F-statistic.

To get the right F-distribution, we need to know that the degrees of freedom are:


k-1 = 2  # k = number of groups
N-k = 19 # N = total observations


Since 3.71 is just higher than the 95% quantile of this F-distribution we can reject the null hypothesis H0.

I found a calculator online. You can see the results in the screenshot.



The underlying calculation is pretty simple. We compute sumsq, the sum of the squares of the differences from the mean for several sets of values and the relevant means.

For the within groups comparisons, using mathematical notation this is (i groups with j observations in each group):


    Σ         Σ     (xij - mi)2
(over i) (over j)


In Python, for each group we sumsq for the group compared with the group mean, and add the results for all three groups to give SSD_W.

To carry out the between groups comparisons, we first compute the grand mean m (of all of the samples). Then for each group we compute:


    Σ        Σ     (mi - m)2
(over i) (over j)


Since the squared value is the same within each group, this is equivalent to:


    Σ      ni (mi - m)2
(over i)


In the Python code this becomes:


len(g)*(mean(g)-m)**2


and sum these over all the groups to give SSD_X. This is a sum of squares of the group means.

Finally, we compute:


MS_W = SSD_W/(N-k)
MS_X = SSD_X/(k-1)
F_stat = MS_X/MS_W


As to why we do this, for now you will have to go read the article. The explanation in Dalgaard is particularly clear, indeed, the whole book is excellent.

There is a SciPy function to carry out ANOVA (stats.f_oneway), but I don't have SciPy installed right now at home, and this post is long enough already. That's for another day.

Python code:


fn = 'data.txt'
FH = open(fn,'r')
data = FH.read().strip().split()
FH.close()

data = [int(n) for n in data]
A = data[:8]
B = data[8:17]
C = data[17:]

#A = [243,251,275,291,347,354,380,392]
#B = [206,210,226,249,255,273,285,295,309]
#C = [241,258,270,293,328]

def mean(L):
return sum(L)*1.0/len(L)

def sumsq(L):
m = mean(L)
print 'sumsq'
rL = [(x-m)**2 for x in L]
for n,p in zip(L,rL):
print n, round(p,1)
S = sum(rL)
print 'total', round(S,1), '\n'
return S

def ANOVA(G):
# variation within groups
SSD_W = 0
for g in G:
SSD_W += sumsq(g)

# a bit awkward, just flattening the list of lists
# to get the mean and N
T = list()
for g in G:
T.extend(g)
m = mean(T)

# variation between groups (X for 'cross')
SSD_X = 0
for g in G:
SSD_X += len(g)*(mean(g)-m)**2

N = len(T) # 22
k = len(G) # 3
MS_W = SSD_W*1.0/(N-k)
MS_X = SSD_X*1.0/(k-1)
F_stat = MS_X/MS_W
return MS_W, MS_X, F_stat

MS_W, MS_X, F_stat = ANOVA([A,B,C])
print 'MS_W', round(MS_W,2)
print 'MS_X', round(MS_X,2)
print 'F_stat', round(F_stat,2)