Optimization of Perceptron Weights
How it is possible to apply these techniques to teach a perceptron, and specifically train the weights to produce the right results? This is done on a case-per-case basis. Given specific inputs mx(k), the corresponding desired output y(k) = t is needed; this is the target for the optimization.
In computer games, the desired output can be obtained in many ways. It can be designed by an expert (training), copied from human behavior (imitation), computed by another AI component (automated learning), or based on a previous result (bootstrapping). This process is discussed further in Chapter 35, "Designing Learning AI."
Optimization of perceptrons is also a best-guess process. In this case, the estimate is a set of weights. The output of the perceptron can be evaluated with these weights. Comparing the actual output to the desired output gives us an error. Then, given the current network, we compute a new estimate of the weights that generate fewer errors. Correcting the weights is done by the delta rule.
The Delta Rule Explained
Reducing the error requires adjusting the output toward its ideal value. To do this, we use multiple small adjustments in the weights instead (because only they can change). The adjustments can be distributed proportionally to the contribution of the weights; the bigger the contribution to an error, the bigger the adjustment!
With the intuitive explanation covered, let's move into a more practical mode. First, relative error E is computed using the difference between the actual output y and the desired output t. This is known as an error function, which usually takes the form of E = ½(t–y)2. This is known as a least mean square error (LMS). Using the square of the difference t–y implies that the error has the same magnitude whether the output is too high or too low. Ideally, this error should be zero and the purpose of the training algorithm is to get it as low as possible.
It's now possible to compute the gradient of the error in each of the weights. This is denoted E/wi, which can be read "gradient of the error E relative to weight wi." The symbol is not Greek, but is used to denote a partial derivative (that is, a derivative of a variable in terms of another). In this case, the partial derivative simplifies the signed error t–y scaled by the input xi, but negated (see Figure 17.11):
Figure 17.11. Correcting the weights of the perceptron based on the output error, proportionally to the input values.
Then, to adjust the weights, we can use any gradient method: steepest descent, for example. The step Dwi, which is used to adjust each weight, is the difference between the desired output and the actual output (t–y) multiplied by the input xi:
This part of the chapter provides proof of how to compute the gradient E/wi for each weight. Those with little interest in the proof should skip a page to the last paragraph in this subsection. However, understanding this outline really opens the doors to more advanced optimization techniques (in academic papers), but isn't really necessary to implement the code! The proof relies on differential calculus.
Noting that the error only depends on the net sum z, the chain rule can be applied:
This expresses the gradient of the error in each weight as the gradient of the error relative to the net sum E/z, multiplied by the gradient of the net sum relative to the weight z/wi. Noting that z = Si wixi, the second right-side term can be simplified greatly. (Weights are independent from each other.)
Now for the first right-side term. The net sum is the output y before the activation function s, so the gradient can be re-expressed (using the derivative of the error function):
Because most perceptrons since the Adaline use an identity activation function as s(z) = z, s'(z) = 1 holds. This implies the following:
There we have it. The gradient of the error for each weight is just the negated input multiplied by the relative error –xi(t–y). This can be used to adjust the weights by one of the many gradient methods previously presented (usually stepping opposite to the gradient).
This result is surprisingly intuitive. When we want to adjust the output for a given input, all we have to play with is the weights. Each weight could be adjusted to contribute toward the target using (t–y). However, the contribution of the weight depends on the value of the input, so this step (t–y) can be scaled by xi to influence only the relevant weights. Scaling that by the learning rate between [0,1] will allow for better learning.