Python for Bioinformatics: KS test

Monday, April 30, 2012

KS test

I ran into a question about using the Kolmogorov-Smirnov test for goodness-of-fit in R and Python. I'd heard of the KS test but didn't know what it was about so I decided to explore.

Wikipedia's explanation was opaque at first (not surprising), but I found nice ones here and here.

As the references say, the KS test can be used to decide if a sample comes from a population with a specific distribution. We calculate a statistic D, and if the statistic is large enough we reject the null hypothesis that the sample comes from the specified distribution.

What is the D statistic? I think it is just the maximum deviation between the observed cdf for the sample, compared to the distribution. That works for the script below.

The KS test is a non-parametric test, it doesn't depend on assumptions about the distribution (e.g. standard deviation) of the data.

For these tests, I needed to reinstall scipy. The previous difficulties were not observed this time. It was as simple as:

git clone git://github.com/scipy/scipy
cd scipy
python setup.py build
sudo python setup.py install

The script below grabs 20 numbers from the normal distribution (sd = 2.0) and compares them with the standard normal distribution (sd = 1). The script does a plot (screenshot above) and prints the results of the KS test:

> python script.py 
max 0.198
D = 0.20, p-value = 0.36

We are not able to reject the null hypothesis, even though the source of the numbers had sd=2 and the null hypothesis is that sd=1.

I did a quick check by copying the numbers

x=c(-5.05,-2.17,-1.66,-1.35,-0.98,-0.97,-0.84,-0.74,-0.67,-0.00,0.21,0.24,0.33,0.67,0.94,1.63,1.65,2.05,2.22,3.19)

and pasting them into R. Then:

> ks.test(x,pnorm,0,1)

 One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.1986, p-value = 0.3611
alternative hypothesis: two-sided

Which matches. I also wrote a slightly hackish test of the test in R:

f <- function () {
 c = 0 
 for (i in 1:100) {
 y1 = rnorm(n,mean=0,sd=1)
 result = ks.test(y1,"pnorm",0,1)
 if (result$p.value < 0.05) {
  print (result$statistic)
  print (result$p.value)
  }
 }
}

This function spews out the values of the parameters if the p-value is < 0.05. We're testing against the same distribution from which the samples were drawn (i.e. the null hypothesis is correct), so we expect that about 5 of the 100 tries will have a test statistic so extreme that we're fooled into thinking the null hypothesis is wrong. Output:

> n = 10
> f()
       D 
0.454372 
[1] 0.020945
        D 
0.4113619 
[1] 0.04811348
        D 
0.4722679 
[1] 0.01440302
      D 
0.43666 
[1] 0.02984022
        D 
0.5340613 
[1] 0.003428069
        D 
0.4336947 
[1] 0.03161219

Looks good to me.

script.py

import matplotlib.pyplot as plt
from scipy.stats import uniform, norm, kstest
import numpy as np

np.random.seed(1773)

def steps(L):
    delta = 1.0/len(L)
    y = 0
    rL = list()
    for x in L:
        y += delta
        rL.append(y)
    return rL

xL = np.random.normal(0,2.0,20)
xL.sort()
yL = steps(xL)

#----------------------------------

# points
plt.scatter(xL, yL, s=50, zorder=1)

# distribution
rv = norm()
pL = np.linspace(-3, 3)
plt.plot(pL, rv.cdf(pL), '--')

# vertical lines
dL = list()
for x,y0 in zip(xL,yL):
    y1 = rv.cdf(x)
    dL.append(abs(y1-y0))
    plt.plot((x,x),(y0,y1),
        color='r', zorder=0)

ax = plt.axes()
ax.set_xlim(-4, 4)
ax.set_ylim(-0.1, 1.1)
plt.savefig('example.png')

print 'max %3.3f' % max(dL)
t = kstest(xL,'norm')
print 'D = %3.2f, p-value = %3.2f' % t

xL = ['%3.2f' % n for n in xL]
print ','.join(xL)