Linear Function Approximation

A linear function approximator is a function y=f(x,w) that is linear in the weights, though not necessarily linear in the input x:
    y = w_1 * f_1(x) + w_2 * f_2(x) + ... + w_n * f_n(x)
where x, y, and w can be vectors, the f_i() functions can be linear or nonlinear, and w_i is the ith element of the w vector. Examples of linear function approximators include: Linear function approximators have several nice properties. For supervised learning, the weights can be found immediately at the cost of a single matrix inversion, without any gradient-descent or incremental learning. For reinforcement learning, the weights can be found at the cost of solving a single linear program.

For incremental reinforcement learning algorithms there are also a few useful properties. For a lookup table, almost all the incremental algorithms are guaranteed to converge to optimality. For other linear function approximators, TD(lambda) is guaranteed to converge when doing on-policy training (transitions are trained on with a frequency proportional to their frequency in the Markov chain).

Unfortunately, very few other convergence results are true. TD(lambda) can diverge for off-policy training. Q-learning can diverge even with on-policy training. SARSA can oscillate wildly and periodically forget everything useful it had learned so far. For incremental algorithms, limiting the function approximator to linear function approximators does not help convergence very much.

Fortunately there are ways to ensure that all these algorithms will converge: use the the residual form of each algorithm. In that case, they will converge for both linear and nonlinear function approximators.

There are also results indicating that nonlinear function approximators may be more powerful in general than linear function approximators for learning high-dimensional functions. For example, if the target function is fairly smooth (has little energy in the high requencies), and the function approximator is nonlinear, then it is known that the number of weights needed for a good fit grows only polynomially with the dimensionality of the input. This is true for such diverse function approximators as sigmoidal neural networks and linear combinations of sine waves. It may even be true for all of the popular nonlinear function approximators. But it has been shown that it is not true for any linear function approximator. This result suggests that linear function approximators may be most useful in cases where the input space is low dimensional, or where the training examples all come from a low-dimensional manifold in a high-dimensional space.

More Information

Back to Glossary Index