Python for Bioinformatics: Examining distributions in R

Thursday, August 13, 2009

Examining distributions in R

In this post, I want to look at some distributions using R. Let's get 100 samples from the normal distribution with mean = 5:

set.seed(157)
v = rnorm(100,mean=5)
summary(v)

> summary(v)
   Min. 1st Qu.  Median 
  2.375   4.292   5.032 
   Mean 3rd Qu.    Max. 
  5.072   5.865   8.349

The sort function does what it says. We make a plot called a "step" plot:

n = length(v)
s = sort(v)
plot(s,(1:n)/n,
  type='s',lwd=2,col='darkred',
  ylim=c(0,1))
points(mean(v),0.5,
  pch=16,cex=2,col='blue')

You can also use draw a Q-Q plot (quantile-quantile). This plots the quantiles for our distribution (y-axis) against those for a normal distribution (x-axis). A straight line indicates that our distribution is close to normal.

qqnorm(v)

Let's get a larger sample:

v = rnorm(10000,mean=5)
m = mean(v)

> m
[1] 4.988964

s = sd(v)

> s
[1] 1.000791

The quantile function accepts an argument telling it how to make the quantiles:

S = seq(0,1,by=0.001)
x = quantile(v,S)
round(x[25:27],2)
round(x[975:977],2)

> round(x[25:27],2)
2.4% 2.5% 2.6% 
3.04 3.05 3.06 
> round(x[975:977],2)
97.4% 97.5% 97.6%
 6.93  6.95  6.98

As expected, since the mean is 5 and the standard deviation is 1, 97.5 % of the values are < 5 + 1.96.

I showed this trick for plotting the normal distribution before. Now we plot it in red, and then overlay it with the density from the t-distribution (obtained with the function dt) with degrees of freedom (df) = 9,6,3,2,1.

plot(function(x) dnorm(x),
  -5:5,max(v),lwd=2,col='red')

color.list=c(
  'purple','blue','steelblue',
  'darkred','magenta')
L = c(9,6,3,2,1)
for (i in 1:5) {
  x = L[i]
  plot(function(x) dt(x,df=i),
    -5:5,max(v),lwd=2,
    col=color.list[i],
    add=T) }

The t distribution has fatter tails. If we had plotted it for larger values of df, it would asymptotically approach the normal distribution.

By using a very large sample, we can get a good idea for the cut-offs for 95% and 97.5% from the normal distribution:

S = seq(0,1,by=0.001)
v = rnorm(100000) 
x = quantile(v,S)
round(x[950:952],2)
round(x[975:977],2)

> round(x[950:952],2)
94.9% 95.0% 95.1% 
 1.63  1.64  1.65 
> round(x[975:977],2)
97.4% 97.5% 97.6% 
 1.94  1.96  1.98

And now compare them to the cut-offs for the t-distribution with df=9. We'll use these in the next example.

w = rt(100000,df=8)
y = quantile(w,S)
round(y[950:952],2)
round(y[975:977],2)

> w = rt(100000,df=8)
> y = quantile(w,S)
> round(y[950:952],2)
94.9% 95.0% 95.1% 
 1.84  1.85  1.86 
> round(y[975:977],2)
97.4% 97.5% 97.6% 
 2.28  2.30  2.33