## Thursday, August 13, 2009

### Basic regression

Continuing with my education in statistics, I'm reading Dalgaard's book, Introductory Statistics with R. The topic of this post is linear regression by least squares.

We're going to model errors in the data as follows:

 `set.seed(157)x = runif(10)e = xfor (i in 1:length(x)) { e[i] = rnorm(1,mean=x[i],sd=0.3) }y = 2.4*x + eplot(y~x,pch=16,col='blue',cex=1.8)`

[UPDATE: this is not the correct way to model the errors. See here.]

 `dx = x-mean(x)dy = y-mean(y)n = length(x)sum(dx**2)/(n-1)var(x)`

The variance of x is computed in the usual way:

 `> sum(dx**2)/(n-1)[1] 0.1182592> var(x)[1] 0.1182592`

The covariance of x and y is defined in such a way that the variance of x is equal to the covariance of x with itself. That makes it easy to remember!

 `sum(dx*dy)/(n-1)cov(x,y)`

 `> sum(dx*dy)/(n-1)[1] 0.4309724> cov(x,y)[1] 0.4309724`

Correlation is just like covariance, only it is computed using z-scores (normalized data).

 `zx = (x-mean(x))/sd(x)zy = (y-mean(y))/sd(y)cov(zx,zy)cor(x,y)`

 `> cov(zx,zy)[1] 0.9515839> cor(x,y)[1] 0.9515839`

As discussed in Dalgard (Ch. 5), the estimated slope is:

 `β = cov(x,y) / var(x)`

 `> cov(x,y) / var(x)[1] 3.644302`

while the intercept is:

 `α = y - β*x`

The linear regression line goes through x, y.

Let R do the work:

 `> lm(y~x)Call:lm(formula = y ~ x)Coefficients:(Intercept) x -0.04804 3.64430`

The estimated slope is 3.64, while the true value is 2.4. I thought the problem was the last point, but it goes deeper than that.

 `># doing this by hand> i = 7 # index of max y value> lm(y[-i]~x[-i])Call:lm(formula = y[-i] ~ x[-i])Coefficients:(Intercept) x[-i] -0.1053 3.5803`

 `> summary(xy.model)Call:lm(formula = y ~ x)Residuals: Min 1Q Median 3Q -0.456059 -0.369435 -0.006057 0.239960 Max 0.814548 Coefficients: Estimate Std. Error t value(Intercept) -0.04804 0.28819 -0.167x 3.64430 0.41621 8.756 Pr(>|t|) (Intercept) 0.872 x 2.27e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4294 on 8 degrees of freedomMultiple R-squared: 0.9055, Adjusted R-squared: 0.8937 F-statistic: 76.67 on 1 and 8 DF, p-value: 2.267e-05 `

The correlation coefficient is

 `r = sum(dx*dy) / sqrt( sum(dx**2) * sum(dy**2) )`

That is

 `r = cov(x,y) / sqrt(var(x)*var(y))`

 `> cov(x,y) / sqrt(var(x)*var(y))[1] 0.9515839`

I am not sure why this doesn't match the output above.

To get the plot shown at the top of the post we use another extractor function on the linear model:

 `xy.model = lm(y~x)w = fitted(xy.model)`

w contains the predicted values of y for each x according to the model.

 `plot(y~x,pch=16,col='blue',cex=1.8)xy.model = lm(y~x)abline(xy.model)segments(x,w,x,y,col='blue',lty=2)`

We can do an even fancier plot, with confidence bands, as follows.

 `df = data.frame(x)pp = predict(xy.model,int='p',newdata=df)pc = predict(xy.model,int='c',newdata=df)plot(y~x,pch=16,col='blue',cex=1.8)matlines(x,pc,lty=2,col='black')matlines(x,pp,lty=3,col='black')`

This is promising but it is obviously going to take more work.