CS 3414 Problem Statement 3

Fitting noisy data is a very common operation in science and engineering.

Consider the following data from an experiment, where some quantity y was measured at intervals of one second, starting at time x = 1.0.

  x     y
 1.0  5.0291
 2.0  6.5099
 3.0  5.3666
 4.0  4.1272
 5.0  4.2948
 6.0  6.1261
 7.0 12.5140
 8.0 10.0502
 9.0  9.1614
10.0  7.5677
11.0  7.2920
12.0 10.0357
13.0 11.0708
14.0 13.4045
15.0 12.8415
16.0 11.9666
17.0 11.0765
18.0 11.7774
19.0 14.5701
20.0 17.0440
21.0 17.0398
22.0 15.9069
23.0 15.4850
24.0 15.5112
25.0 17.6572
If we do a simple plot of this data, we see this.

A general trend in the data can be seen by eye. But how can we know that we have found a good fit to the data? What does `good' even mean in this instance? One easy way to answer these questions is to try various models for the data and simply evaluate the norm (size) of the residual vector. Recall that the i-th residual is just ri = yi - F(xi), where (xi, yi) is the given data point and F(xi) is my approximation at that point.

Now suppose we suspect some of these points should be thrown out due to experimental error. How can we identify these bad points? An easy approach to answering this question is as follows. If we are doing a good job of fitting the data, statistical theory tells us that about 95% of the `scaled residuals' should lie in the interval [-2,2]. The i-th scaled residual is just ri/s, where s is a scaling factor defined by

       || residual ||_2
s = ---------------------,
         sqrt(m-n)
where m is the number of data points and n is the number of basis functions used to define F(x).