Gradient Descent
This page discusses the theory behind backpropagation using another error function commonly used with neural networks. This one is called mean squared error.
Gradient Descent with Squared Errors Math Overview
Our overall goal is to find the weights for our neural networks. These weights need to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are: the error. One common metric is the sum of squared errors, SSE:
where:
-
: prediction -
: true value -
: output units -
: data points -
The inside sum, over
, says: for each output unit, find the difference between the true value, , and the predicted value from the network . -
The outer sum, over
, says: for each data point, calculate the inner sum. Namely, the squared differences for each output unit.
Summing up those squared differences for each data point gives the overall error for all output predictions for all data points.
Recall that the output of a neural network, the prediction, depends on the weights:
Plugging the prediction equation above into the error formula:
Mountain Analogy for the Gradient
Mathematically, the gradient is a derivative generalized to functions with more than one variable. The gradient of the error function can be visualized as a mountain. Since the error is a function of the weights (reference formula above), we take small steps down the mountain by iteratively improving the weights.
The following image is an example of the error of a neural network with two inputs, and accordingly, two weights. This can be read like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.
At each step, we calculate the error and the gradient. Using those, we determine how much to change each weight. Repeating this process eventually finds the weights that are close to the minimum of the error function. This is symbolized by the black dot in the middle of the image.
Note that the weights will go to wherever the gradient takes them. So, they can end up where the error is relatvely low, but not the absolute lowest on the overall “mountain.” These spots are referred to as local minima.
There are methods to help avoid this, such as using momentum.
Gradient Descent Math Details
Notes in this section taken on this great lecture courtesy of Udacity.
The following image shows a simple neural network.
Each prediction,
To optimize the model, we can present it with data that we know to be true and set the model parameters, or “weights,” to match that data. In order to optimize the model, we need some notion of how much error the network has relative to some true values.
The obvious choice is the difference between the true values,
This represents error for a single prediction. To get the error for the entire dataset, we introduce a summation, where
This formulation is known as the “Sum of the Squared Errors,” or SSE for short. The smaller it is, the better the network is performing on the data.
The data records contain input labels,
Now we see that the error is a function of the weights. The weights let us tune the network.
What follows is a simple depiction of the error with one weight. Our goal is to find the weight at the bottom of the bowl. Starting at some random weight, we make several steps in the direction of the minimum weight. The direction of that step is called the opposite (negative) of the gradient.
Each of these weight steps is represented in the following equations as
The following says that the weight step is proportional to the negative of the gradient. The gradient is the partial derivative of the error function with respect to each weight,
We set an arbitrary scaling factor called the “learning rate,”
Calculating the gradient obviously requires multivariable calculus, since we are taking partial derivatives.
Proceeding with taking the derivative requires the chain rule. The chain rule says that taking the derivative of the convolution of two functions is equal to the product of the partial derivative of the outer function with respect to the inner, and the partial derivative of the inner function with respect to the variable it is a function of.
As a specific example:
Now we can plug in the specifics of the derivative we are taking.
Recall that
Note that since
It follows that
Recalling that
For convenience, we can define an error term,
Then, we can write our weight update as
We might also work with multiple output units. This can be thought of as stacking the architecture from the single output network, but connecting the input units to the new output units.
Now, the total network error would include the error of each output, summed together.
The gradient descent step can be extended to a network with multiple outputs by calculating an error term for each output neuron, denoted with a subscript,