iconEuler Examples

Errors of Linear Regression

by R. Grothmann

In this notebook, we want to test a linear regression with quadratic
polynomials. First let us generate a random scatter of the polynomial
0.1x^2+0.9x^2 at 11 points, evenly spaced in [0,1].
>seed(0.5); x:=0:0.1:1; y:=0.1*x^2+0.9*x+0.01*normal(size(x));
>plot2d(x,y,a=0,b=2,c=0,d=2.5,points=1);
It is easy to find the best fit, i.e., the polynomia p of degree 2,
which minimizes

Errors of Polynomial Regression

>p:=polyfit(x,y,2)
[ 0.0116895315155  0.890503954929  0.100593820388 ]
We add this to our plot.
>plot2d("evalpoly(x,p)",add=1):

Errors of Polynomial Regression

We have used the measurement to extrapolate the data in [1,2].

How can we check the error of this procedure?

Here is one method, which does not work: We assume, that we could
just as well have got any data between our measurements, and the
values of our estimated polynomial. This is reasonable. We extrapolate
from any of these hypothetical data with a polynomial fit. Where
would we get? 

A simple interval analysis fails by overestimating the error grossly.
>yp:=evalpoly(x,p); ...
 yi:=~y~||~yp~;
yi are the intervals betwee y (measurements) and yp (estimated
true values). We solve the normal equation of polynomial fit in
an interval fashion.
>A:=(1_x_x^2)'; bi:=ilgs(A'.A,A'.yi')
                             ~-0.081,0.1~ 
                           ~0.3952,1.386~ 
                             ~-0.36,0.56~ 
The polynomial is way to bad. let us draw the possible values
into the orginal plot.
>t:=0:0.05:2; s:=evalpoly(t,bi'); ...
 plot2d(t,left(s)_right(s),add=1):

Errors of Polynomial Regression

This is bad.

So we try a Monte Carlo simulation. 1000 times, we select random
y-values between y and yp, and compute the minima and maxima of all
extrapolations for t between -0.5 and 1.5.
>function test (t,x,y,yp,n=1000) ...
  yr=yp+random(size(yp))*(y-yp);
  smin=evalpoly(t,polyfit(x,yr,2));
  smax=smin;
  loop 1 to n;
  yr=yp+random(size(yp))*(y-yp);
  s=evalpoly(t,polyfit(x,yr,2));
  smin=min(smin,s);
  smax=max(smax,s);
  end
  return {smin,smax};
endfunction
Now plot all that in one figure.
>plot2d(x,y,a=0,b=2,c=0,d=2.5,points=1);
Compute the Monte Carlo simulation.
>t=0:0.05:2; {smin,smax}=test(t,x,y,yp);
Add it to the plot.
>plot2d(t,smin_smax,add=1,color=5):

Errors of Polynomial Regression

This does not look bad.

But one could argue, that the plot shows that we almost have no
information at all. Any schoolboy would give a point in the range,
when asked to predict a value for 1.4.

The maximal error is more than 13 times larger than the random
error, we started with.
>max(smax-smin)
0.124089450596
Another way to look at this is to simulate 0-0.1-normal distributed
scatter of the correct polynomial, and study the distribution
of the extrapolation in 2.
>x:=0:0.1:1; yt:=0.1*x+0.9*x^2;
We do that 1000 times.
>n:=1000; s=zeros(1,n); ...
 loop 1 to n; s[#]:=evalpoly(2,polyfit(x,yt+0.01*normal(size(x)),2)); end;
The results are randomly distributed around the expected value
3.8. (the value of the exact polynomial at 2).
>{x1,y1}=histo(s,20); plot2d(x1,y1,bar=1):

Errors of Polynomial Regression

>mean(s), dev(s)
3.79538318084
0.0754949083364
The standard deviation becomes much larger than the deviation
of the data.

Examples Homepage