I ran into a question about using the Kolmogorov-Smirnov test for goodness-of-fit in R and Python. I'd heard of the KS test but didn't know what it was about so I decided to explore.
Wikipedia's explanation was opaque at first (not surprising), but I found nice ones here and here.
As the references say, the KS test can be used to decide if a sample comes from a population with a specific distribution. We calculate a statistic D, and if the statistic is large enough we reject the null hypothesis that the sample comes from the specified distribution.
What is the D statistic? I think it is just the maximum deviation between the observed cdf for the sample, compared to the distribution. That works for the script below.
The KS test is a non-parametric test, it doesn't depend on assumptions about the distribution (e.g. standard deviation) of the data.
For these tests, I needed to reinstall scipy. The previous difficulties were not observed this time. It was as simple as:
git clone git://github.com/scipy/scipy cd scipy python setup.py build sudo python setup.py install |
The script below grabs 20 numbers from the normal distribution (sd = 2.0) and compares them with the standard normal distribution (sd = 1). The script does a plot (screenshot above) and prints the results of the KS test:
> python script.py max 0.198 D = 0.20, p-value = 0.36 |
We are not able to reject the null hypothesis, even though the source of the numbers had sd=2 and the null hypothesis is that sd=1.
I did a quick check by copying the numbers
x=c(-5.05,-2.17,-1.66,-1.35,-0.98,-0.97,-0.84,-0.74,-0.67,-0.00,0.21,0.24,0.33,0.67,0.94,1.63,1.65,2.05,2.22,3.19)
and pasting them into R. Then:
> ks.test(x,pnorm,0,1) One-sample Kolmogorov-Smirnov test data: x D = 0.1986, p-value = 0.3611 alternative hypothesis: two-sided |
Which matches. I also wrote a slightly hackish test of the test in R:
f <- function () { c = 0 for (i in 1:100) { y1 = rnorm(n,mean=0,sd=1) result = ks.test(y1,"pnorm",0,1) if (result$p.value < 0.05) { print (result$statistic) print (result$p.value) } } } |
> n = 10 > f() D 0.454372 [1] 0.020945 D 0.4113619 [1] 0.04811348 D 0.4722679 [1] 0.01440302 D 0.43666 [1] 0.02984022 D 0.5340613 [1] 0.003428069 D 0.4336947 [1] 0.03161219 |
Looks good to me.
script.py
import matplotlib.pyplot as plt from scipy.stats import uniform, norm, kstest import numpy as np np.random.seed(1773) def steps(L): delta = 1.0/len(L) y = 0 rL = list() for x in L: y += delta rL.append(y) return rL xL = np.random.normal(0,2.0,20) xL.sort() yL = steps(xL) #---------------------------------- # points plt.scatter(xL, yL, s=50, zorder=1) # distribution rv = norm() pL = np.linspace(-3, 3) plt.plot(pL, rv.cdf(pL), '--') # vertical lines dL = list() for x,y0 in zip(xL,yL): y1 = rv.cdf(x) dL.append(abs(y1-y0)) plt.plot((x,x),(y0,y1), color='r', zorder=0) ax = plt.axes() ax.set_xlim(-4, 4) ax.set_ylim(-0.1, 1.1) plt.savefig('example.png') print 'max %3.3f' % max(dL) t = kstest(xL,'norm') print 'D = %3.2f, p-value = %3.2f' % t xL = ['%3.2f' % n for n in xL] print ','.join(xL) |