Figure 8: Kernel regression on three different data sets, with localness values
of 4, 3, and 3, respectively.
The fourth and last thing we need to define for a memory based learner is the function it will use to fit the local data. Kernel regression was an improvement over k-nearest neighbor on the data sets we've seen so far. However, it can still look fairly bad. The three data sets in fig. 8 show this.
File -> Open -> g1.mbl Edit -> Metacode -> Localness 4: Very local Model -> Graph -> Graph File -> Open -> d1.mbl Edit -> Metacode -> Localness 3: Very very local Model -> Graph -> Graph File -> Open -> h1.mbl Model -> Graph -> Graph
Again, the kernel widths were chosen manually. Notice how the first two do an extremely poor job of extrapolating away from the data. The middle graph also has a strange bump where the gap in the data is. The data in h1.mbl was generated with a three segment piecewise linear function plus noise. Unfortunately, the gradient between the first two segments is not very trustworthy.
This problem can be addressed by changing the local model being fit. K-nearest neighbors and kernel regression use a constant, or averaging, local model. Sometimes, the fitted function can be improved by using a more sophisticated local model. Next, we consider a linear local model.
At the beginning of this tutorial we saw how global linear regression looks on some sample data sets. Now, we'll briefly summarize the math behind linear regression. It is fine to skim this section if looking at equations is not your thing. Assume that all of the data was generated from the following relationship:
where the k subscript is the data point number, is the kth input, is the kth output, is the coefficient for the input, and is drawn from a Gaussian distribution with mean zero and standard deviation . may be a vector of outputs, and may be a vector of inputs (which would make a vector and a matrix), but we leave them scalar here to simplify. The problem is to find the value which makes the given data set most likely. The which satisfies this condition turns out to be:
In order to find the value for which minimizes the right hand side of eq. 6, take the derivative of the right hand side with respect to and set it to zero. Doing so gives the following:
Usually, it is preferable to write this equation in its matrix form:
where is a matrix with one column for each input attribute and one row for each data point, is a matrix with one column for each output attribute and one row for each data point, and is a matrix with the number of rows equal to the number of outputs and the number of columns equal to the number of inputs.
Eq. 5 assumes that the regression line passes through the origin. In order to remove that constraint, it is necessary to add an extra ``input attribute'' whose value is always 1 for every data point. Therefore, if a data set contains n input attributes, the matrix has n+1 columns.
Fig. 2 shows global linear regression on several data sets. We can improve on the fit to these by fitting the linear model locally, just as we computed weighted averages earlier. A weight for each data point is computed just as before. Rather than computing a weighted average of the outputs as done in kernel regression, the weight is used on the inputs and outputs of the the data point. In matrix form, the weights for the data points are composed into a diagonal matrix, W, where the value of the ith diagonal element is the weight on the ith point. The regression equation becomes:
Fig. 9 show a graphical comparison between the way global linear regression works and the way local linear regression works.
Figure 9: Comparison between global and local linear regression. a) Global
linear regression minimizes the sum of squared distances between the data
and the line. The result is a line fit to the data. b) Local linear
regression fits a line at the query point. It minimizes the sum of weighted
squared distances between the data and the line. The line widths reflect
the relative strengths of the data points in influencing the fit at the
indicated query. The final result is a nonlinear curve fit to the data.
Figure 10: Local linear regression on three different data sets,
with localness values of 4, 2, and 5, respectively.
Fig. 2 shows what global linear regression does on some sample data files. Vizier can show us what locally weighted linear regression does on these same files.
File -> Open -> j1.mbl Edit -> Metacode -> Localness 4: Very local Regression L: Linear Model -> Graph -> Graph File -> Open -> k1.mbl Edit -> Metacode -> Localness 2: Ultra local Model -> Graph -> Graph File -> Open -> a1.mbl Edit -> Metacode -> Localness 5: Fairly local Model -> Graph -> Graph
These graphs are shown in fig. 10. The fits appear better than the global linear regression graphs and the local averaging in fig. 7. Linear regression is also preferable to averaging because it can be used to make estimates of gradients (there will be more about gradients later).
Figure 11: Kernel, linear, and quadratic regression on one data set. The best localness values are determined manually to be 2, 3, and 4, respectively.
After observing that fitting a local linear model can be better than a local averaging model, we might ask whether there are other useful models. An obvious candidate is local quadratic regression. This is done by adding the quadratic terms to the matrix. For example, the full matrix for a two input quadratic regression would include separate columns for the terms: . This increases the dimension of the resulting matrix accordingly, but the equation for computing it is the same. In order to see the effects of quadratic regression, we can compare it against the average and linear models.
File -> Open -> i1.mbl Edit -> Metacode -> Localness 2: Ultra local Regression A: Average Model -> Graph -> Graph Edit -> Metacode -> Localness 3: Very very local Regression L: Linear Model -> Graph -> Graph Edit -> Metacode -> Localness 4: Very local Regression Q: Quadratic Model -> Graph -> Graph
Fig. 11 shows these graphs. The graphs get successively smoother as the regression goes from average to linear to quadratic. Also note that the best localness value increases. One way to get a smoother graph while using averaging is to increase its localness value, but this has the negative side effect of adding bias and making the fit worse (you can test this yourself with Vizier). One of the advantages of higher order polynomial regressions is that they allow you to use a wider smoothing kernel without causing bias problems. A wider kernel is advantageous because it makes use of more data in its fit. While linear regression has the advantage of giving gradient estimates, quadratic regression has the additional advantage of giving estimates of the local Hessian, which can be useful when doing optimization. The drawback of quadratic regression is that it has more parameters which means it requires more data to fit them well and it requires more computation to do the fit.
Vizier has two other models which are part way between linear and quadratic. The ``Ellipses'' model has all the pure quadratic terms (the ), but none of the cross terms ( for ). The ``Circles'' model is a linear model plus a single extra term for the sum of all the pure quadratic terms ( ). In some cases, one of these two models is a good compromise between linear and quadratic. It doesn't make sense to use them on the one dimensional data sets we've been using so far because one dimensional inputs have only one pure quadratic term and no cross terms.
At this point we have covered the four things that describe a locally weighted learning: the distance metric, the number of nearest neighbors, the weighting function, and the local model. In Vizier the choice of these four things is specified by the metacode. Using different metacode settings, you can create a variety of different locally weighted learners.