## Thursday, August 13, 2009

### Examining distributions in R

In this post, I want to look at some distributions using R. Let's get 100 samples from the normal distribution with mean = 5:

 `set.seed(157)v = rnorm(100,mean=5)summary(v)`

 `> summary(v) Min. 1st Qu. Median 2.375 4.292 5.032 Mean 3rd Qu. Max. 5.072 5.865 8.349 `

The sort function does what it says. We make a plot called a "step" plot:

 `n = length(v)s = sort(v)plot(s,(1:n)/n, type='s',lwd=2,col='darkred', ylim=c(0,1))points(mean(v),0.5, pch=16,cex=2,col='blue')`

You can also use draw a Q-Q plot (quantile-quantile). This plots the quantiles for our distribution (y-axis) against those for a normal distribution (x-axis). A straight line indicates that our distribution is close to normal.

 `qqnorm(v)`

Let's get a larger sample:

 `v = rnorm(10000,mean=5)m = mean(v)`

 `> m[1] 4.988964`

 `s = sd(v)`

 `> s[1] 1.000791`

The quantile function accepts an argument telling it how to make the quantiles:

 `S = seq(0,1,by=0.001)x = quantile(v,S)round(x[25:27],2)round(x[975:977],2)`

 `> round(x[25:27],2)2.4% 2.5% 2.6% 3.04 3.05 3.06 > round(x[975:977],2)97.4% 97.5% 97.6% 6.93 6.95 6.98 `

As expected, since the mean is 5 and the standard deviation is 1, 97.5 % of the values are < 5 + 1.96.

I showed this trick for plotting the normal distribution before. Now we plot it in red, and then overlay it with the density from the t-distribution (obtained with the function dt) with degrees of freedom (df) = 9,6,3,2,1.

 `plot(function(x) dnorm(x), -5:5,max(v),lwd=2,col='red')color.list=c( 'purple','blue','steelblue', 'darkred','magenta')L = c(9,6,3,2,1)for (i in 1:5) { x = L[i] plot(function(x) dt(x,df=i), -5:5,max(v),lwd=2, col=color.list[i], add=T) }`

The t distribution has fatter tails. If we had plotted it for larger values of df, it would asymptotically approach the normal distribution.

By using a very large sample, we can get a good idea for the cut-offs for 95% and 97.5% from the normal distribution:

 `S = seq(0,1,by=0.001)v = rnorm(100000) x = quantile(v,S)round(x[950:952],2)round(x[975:977],2)`

 `> round(x[950:952],2)94.9% 95.0% 95.1% 1.63 1.64 1.65 > round(x[975:977],2)97.4% 97.5% 97.6% 1.94 1.96 1.98 `

And now compare them to the cut-offs for the t-distribution with df=9. We'll use these in the next example.

 `w = rt(100000,df=8)y = quantile(w,S)round(y[950:952],2)round(y[975:977],2)`

 `> w = rt(100000,df=8)> y = quantile(w,S)> round(y[950:952],2)94.9% 95.0% 95.1% 1.84 1.85 1.86 > round(y[975:977],2)97.4% 97.5% 97.6% 2.28 2.30 2.33`