Strang has an excellent section on least squares. I'm going to work through an example here as a way of building my understanding (homework). We take the simplest possible case of three time points equally spaced at t = 1, 2, 3. At these times we observe data y = 2, 2, 4 (to make it slightly different than his example). The three points are not co-linear, as you might guess and can see from the plot. The numpy function polyfit will fit a line to this data (that's what I plotted above), but we're going to find it ourselves.

The line that "best fits" the data has the equation y = C + Dt. We need to find C and D. We use calculus first, then linear algebra. The magnitude of the "error" at times 1, 2 and 3 is the difference between the value on the line and the value we actually observed.

We calculate the sum of the squares of the errors, which we wish to minimize. One reason for the square is that the minimization involves the derivative, and the derivative of the square function is a linear function.

This is a function of two variables (C and D), so we need to take both partial derivatives and set them both equal to zero, (yielding two equations in two unknowns).

The factor of 2 in each term goes away. Notice that the first equation says that the sum of the errors is equal to zero.

Solve for C in the first eqn:

Substitute C into the second equation:

In Bolstad (p. 237) the equation for the slope is given as:

and the intercept is

Looks good. Let's do some linear algebra! We go back to the equation for the line: C + Dt. We have one equation for each observation, which is written like so:

Ax = b has no solution. We seek the projection of b onto A.

Using the formula from last time, (and Python to do the calculation, below) we obtain:

We can check that the error e = b - p

is perpendicular to p: