Let's have a quick summary of the perceptron (click here).
There are a number of variations we could have made in our procedure. I arbitrarily set the initial weights and biases to zero. In fact, this isn't a very good idea, because it gives too much symmetry to the initial state. It turns out that if you do this with the AND function, you can get into a situation where you cycle through the same sets of weights without converging. However, if you start with small random weights and biases, this breaks the symmetry and it works fine. I didn't do this in our example calculation because I wanted to keep the arithmetic simple.
Another parameter is the learning rate. If it is too small, it can take a long time to converge. If it is too big, you might continually jump over the optimum weight values and fail to converge. Another variation is to sum the changes in weights and biases over a set of four patterns, without applying them to modify the weights until all four patterns have been presented. This is called "training by epoch" as contrasted with "training by pattern". It can often give a little more stability to the process and prevent wild fluctuations in weight values. Obviously, trying out all these variations by hand can be tedious, or nearly impossible for a large network of perceptrons. This is where computer simulations come in. I think that any one of you could write a very simple computer program to explore the perceptron learning algorithm for problems involving a single perceptron with two inputs and a bias. In the next lecture, I'll give you a demo of a simulator program for more complicated networks.
The perceptron learning rule was a great advance. Our simple example of learning how to generate the truth table for the logical OR may not sound impressive, but we can imagine a perceptron with many inputs solving a much more complex problem. Rosenblatt was able to prove that the perceptron was able to learn any mapping that it could represent. Unfortunately, he made some exaggerated claims for the representational capabilities of the perceptron model. (Note the distinction between being able to represent a mapping and being able to learn this representation.) This attracted the attention of Marvin Minsky and his colleague Seymour Papert, who published a devastating critique of the perceptron model in their book "Perceptrons" (1969). In this book, they said:
"Perceptrons have been widely publicized as 'pattern recognition' or 'learning machines' and as such have been discussed in a large number of books, journal articles, and voluminous 'reports'. Most of this writing ... is without scientific value .." and "Appalled at the persistent influence of perceptrons (and similar ways of thinking) on practical pattern recognition, we determined to set out our work as a book."
In the book, they pointed out that there is a major class of problems that can't be represented by the perceptron. A very simple example is the exclusive OR (XOR). They gave a very simple and compelling proof of the impossibility of finding a set of weights that would let a single-layer perceptron give the correct outputs for the XOR truth table. (If we had chosen this for our example, we would have been at it for a long time. The weights would change back and forth, but it would never converge to a final result.)
Although they were aware of the fact that other neural network architectures could produce an XOR (like the McCullogh-Pitts), they felt (incorrectly, it turned out) that there was no way to extend the perceptron learning rule to deal with these sorts of networks. This put a damper on enthusiasm for neural network research, bringing it to a virtual halt for much of the 1970's. Initially, this had the effect of draining off funding for research in neural nets and diverting it toward symbolic AI. But then, the lack of success of AI became apparent and there were "Dark Ages" for neural nets, paralleled by the "Winter of AI". Both of these were brought on by the disillusionment with over-optimistic claims.
Why do we care about the XOR? It is a hard representation for a
neural net to learn, yet it is simple enough for us to understand in
detail, because of the small number of variables. In order for us to
understand why there is no set of weights that will allow a perceptron
to generate the correct outputs for the XOR truth table, let's
generalize things a bit and let the output be a continous value betwen
0 and 1, and use a steep sigmoid instead of a step function for the
output function, V
Let's call V3 approximately 1 if u3 > 0, and
V3 approximately 0 if u3 < 0. Summing the inputs,
we have u3 = W31V1 +
W32V2 + .
The possible inputs V1 and V2 form a 2-D space:
We want to draw a line separating the region where V3 is approximately
equal to 1 from the region where it is approximately zero. For the OR which
we just treated, what is the equation of this line? We can find it by
setting u3 = 0 in the equation above, and get:
V2 = -(W31 / W32)V1
-/ W32
There are many values of the three variables
W31, W32, and that
will give such a line. What about the XOR? We want the output to be 1
(TRUE) if only one input is 1, and the output to be 0 if neither or both are
1. This means we need two lines to partition the space of possible inputs.
We don't have enough weight and bias parameters to define two lines.
Minsky and Pappert identified a number of significant problems which fell
into this category of not being "linearly separable". With two
inputs, a linearly separable problem is one in which you can separate
the two different solutions with a straight line. With three inputs
(3-D), you need a plane. In higher dimensions it would be a hyperplane.
The way around this problem is fairly obvious: add more neurons to
form a "hidden layer" which bridges the the input and output units.
This will give the extra parameters needed to divide up the space of
possible inputs. This is what we did with the McCullogh-Pitts network
that I showed for the XOR. Minsky and Pappert knew this, but couldn't
think of a learning rule to deal with the hidden units, and suspected
that one didn't exist. Here's another condensed quote from
"Perceptrons":
A number of people have since independently discovered the learning
rule for a multi-layered perceptron network. (The reason for this
duplication and lack of communication between researchers is that
the study of neural networks is an interdisciplinary field. Until
recently, there were no neural networks journals; the results were
published in a wide variety of math, physics, biology, psychology and
engineering journals.) Paul Werbos (1974 Harvard Ph.D. thesis) is
possibly the first one to discover what is now known as the
generalized delta rule, or backpropagation algorithm.
Before I describe the training algorithm, I'll show you two different
networks that are capable of representing the solution to the XOR
problem.
The network described in PDP Chapter 8 looks like this:
We have a single hidden unit with input connections to both the hidden and
output layer. (It is hidden in the sense that it doesn't have a direct
output to the outside world.) The biases are shown inside the hidden and
output units, and the weights are showm beside the connections. We say that
the hiden unit forms an "internal representation" of the problem. Can you
tell what it is? What problem is being solved by the hidden unit with these
weights and biases? [You should be able to show that it functions as an AND
unit.] What does the output unit do? [It works like an OR, but with a
strong inhibitory input from the AND.]
The net described in Chapter 5 of "Explorations in Parallel Distributed
Processing" has this architecture:
It is a two layered feed-forward net with two hidden units. (We don't
count the input layer, because it doesn't really do anything.) This
gives it a total of 6 weights and 3 biases to use to separate the
regions that give a 0 output from region that gives an output of 1.
For both of these nets, there are an infinite number of solutions for
the weights and biases that will solve the problem. Also, there are lots
of other architectures. At least two hidden units are needed if the
input units only connect to the output layer, but we might wonder if
there are any advantages of using three or more. Does the net learn
the weights faster,or display more robust behavior to noisy inputs which
aren't quite 0 or 1?
Finding the optimum network architecture or set of learning parameters for
a back propagation calculation is still something of an art. There are
many questions that we would like to answer with mathematical proofs. We
may have to settle for a body of results from simulations which suggest
certain general patterns of behavior. Sometimes these results may suggest
the need for a theory that explains them, just as experimental results
often provide the direction for theoretical analysis in other branches of
science. In some cases, as when it is clear that the internal
representation should be a binary number, it is easy to determine the
minimum number of hidden units needed. Even then, it isn't clear which is
the OPTIMUM number of hidden units for quick learning, tolerance for
``noisy'' input data, or capability to generalize. (Ability to properly treat
inputs which were not explicitly present in the training set.)
Computer simulations give us a way to answer some of these questions by
experiment.
(See the handout "Summary of the Generalized Delta Rule")
Define ti = target (desired) output of unit i, and
ai = the actual output. (We have been calling this Vi,
but I'm now switching to the notation used in "Explorations in PDP".)
This output (the "activation") is calculated from the net input using
a sigmoid function:
As with the single layer perceptron we calculate the net input from
the weighted sum:
The total sum of squared errors (``tss'') is calculated from the ``target
activation'', ti:
The sum is over the output units, because we don't know what the
target activations should be for the hidden units.
We want to adjust the weights in a way that will minimize E. This is
curve fitting in a high dimensional space. How can we make E --> 0?
E is implicitly a function of all these weights and biases. One way
to do it is to adjust the weights in the direction of the negative
gradient of E, so that we make a change in each weight:
Here is an example in two dimensions for the function f(x,y).
It is like a topographical map with lines of constant height. We
want to find our way to the minimum in the center.
The gradient of f (grad f(x,y)) is a vector that is perpendicular to the
lines of constant f, headed uphill. So, to minimize f(x,y), we want to follow
the negative gradient. If we take small steps, we will follow the path
a. This technique is then called "gradient descent", or "steepest
descent". (It might seem obvious that this is the optimum way to minimize the
error. Actually, it isn't. There are better ways with names like "conjugate
gradient" and "quasi-newton" methods. But, this is a fairly good technique,
and is certainly the simplest.)
We can use the chain rule to show that the weight change is:
The expression on the the right hand side arises from a nice feature of the
sigmoidal activation function. From Eq. (1) you should be able to verify
that:
dfi / dui
= fi (1 - fi) = ai (1 - ai)
(There are some intermediate steps left out for you to fill in.)
Putting in a constant of proportionality that absorbs the factor of
two, and adding another term that I will explain in a minute, we have:
where n is
the iteration number, and
Except for the last term, Eq. (3) is like the perceptron learning rule,
but with a different expression for delta.
Here is the learning
rate. If it is too large, we may jump back and forth over the path
along the gradient, following path b, and may not reach the minimum.
The final term in the equation above is an added variation in the
algorithm that prevents radical changes in the weights due to the use
of gradient descent. This term gives our trajectory in weight space
some "momentum" (the parameter )
in order to preserve some memory of the direction it was going. To do this,
we add some of the weight change from the nth iteration to the
weight change that we are calculating for the (n + 1)th iteration.
This is illustrated in path c.
So far, we have something that is similar to the learning rule for
single layer perceptrons, often called the "delta rule", and it works
fine for calculating the changes for the weights to the output layer.
But, it isn't back propagation, yet. We have a problem with the
hidden layers, because we don't know the target activations ti for
the hidden units. The trick, derived using the chain rule in PDP
Chapter 8, is to use a different expression for the delta when unit i
is a hidden unit instead of an output unit:
The first factor in parenthesis involving the sum over k is an
approximation to (ti - ai) for the hidden layers when
we don't know ti. It makes use of the deltas that have been
calculated for the layer above.
Note that the sum over k is the sum
over the units that receive input from the ith unit.
The procedure for adjusting the weights is:
Now we see why it is called back propagation. We start with a forward
pass, presenting an input pattern, and calculate the activations of
each layer from those of the preceding layer, using the current values
of the weights. When we get to the output layer, we can compare the
output activations to the target values for the given pattern, and
calculate the delta values for the output layer. Now we propagate the
error backwards by using these delta values to calculate the
deltas for the preceding layer. If we are training by epoch, we
present another pattern and sum the delta-W's over the set of
patterns, updating the weights at the end of each epoch. Then we
iterate the whole procedure until the error is reduced to an
"acceptable" value. As we did with the single layer perceptron, we
modify the bias terms by treating them just like the weights from a
unit that always has an activation of 1.
Next time, we will demonstrate some backpropagation simulations, using
the neural net simulation software from "Explorations in PDP".
The perceptron ... has many features that attract
attention: its linearity, its intriguing learning theorem
...
There is no reason to suppose that any of these virtues
carry over to the many-layered version. Nevertheless, we
consider it an important research problem to elucidate (or
reject) our intuitive judgment that the extension is sterile.
OUT
/ \
/ \
/ \
H1 H2
| \ / |
| \ / |
| X |
| / \ |
IN1 IN2
Training a feed-forward net - Backpropagation
O O unit k
\ /
\ / Wki
\ /
O hidden unit i
[<--] Return to the list of AI and ANN
lectures
Dave Beeman, University of Colorado
dbeeman "at" dogstar "dot" colorado "dot" edu
Thu Nov 1 16:06:15 MST 2001