## Saturday, August 15, 2009

### Normally squared

According to wikipedia, the chi-square distribution (with df = k), arises as the sum of the squares of k independent, normally distributed random variables with mean 0 and variance 1. For example, in a bivariate distribution where each variable is normally distributed, the square of the distance from the origin should be distributed in this way with df = 2.

 `u = rnorm(10000)v = rnorm(10000)w = u**2 + v**2B = seq(0,max(w)+1,by=1)hist(w,freq=F,ylim=c(0,0.5), breaks=B,col='cyan')curve(dchisq(x,df=2), lwd=5,col='red', from=0,to=10,add=T)`

I found that I could also get the chi-squared distribution by multiplying two random variables (with mean not equal to 0) together, and it seems that in this case the degrees of freedom is equal to the product of the means.

 `u = rnorm(10000,3)v = rnorm(10000,3)w = u*vB = seq(-5,max(w)+1,by=1)hist(w,freq=F,ylim=c(0,0.18), breaks=B,col='cyan')curve(dchisq(x,df=9), lwd=5,col='red', from=0,to=30,add=T)`

Let's explore how to use R to solve the problem of the grade distribution from the last post. We have males and females with grades from A,B,D,F in order, stored in a matrix M:

 `m = c(56,60,43,8)f = c(37,63,47,5)M = t(rbind(m,f))`

 `> M m f[1,] 56 37[2,] 60 63[3,] 43 47[4,] 8 5`

 `r = chisq.test(M)`

 `> r Pearson's Chi-squared testdata: M X-squared = 4.1288, df = 3, p-value =0.2479`

This matches the value given in Grinstead & Snell. We can explore contributions of the individual categories to the statistic as follows.

 `E = r\$expectedO = r\$observed(O-E)**2/E`

We see that the disparity in A's was certainly higher than for the other categories, but the p-value (above) is not significant.

 `> (O-E)**2/E m f[1,] 1.0985994 1.2070139[2,] 0.2995463 0.3291068[3,] 0.3595670 0.3950506[4,] 0.2096039 0.2302885`

We see that the 95th percentile of the chi-squared distribution for df=3 is just a bit larger than 7.8:

 `> S = seq(7,8,by=0.1)> pchisq(S,df=3) [1] 0.9281022 0.9312222 0.9342109 0.9370738 [5] 0.9398157 0.9424415 0.9449561 0.9473637 [9] 0.9496689 0.9518757 0.9539883`

We can do a Monte Carlo simulation (I actually do not know yet how the preceding function works, but the simulation should be just like what we did yesterday:

 `r = chisq.test(M,simulate.p.value=T)`

 `> r Pearson's Chi-squared test with simulated p-value (based on 2000 replicates)data: M X-squared = 4.1288, df = NA, p-value =0.2414`

And Fisher's exact test (also on my study list for the future):

 `r = fisher.test(M)`

 `> r Fisher's Exact Test for Count Datadata: M p-value = 0.2479alternative hypothesis: two.sided`