Gradient Descent

This page discusses the theory behind backpropagation using another error function commonly used with neural networks. This one is called mean squared error.

Gradient Descent with Squared Errors Math Overview

Our overall goal is to find the weights for our neural networks. These weights need to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are: the error. One common metric is the sum of squared errors, SSE:

$$E=\frac{1}{2} \sum_\mu \sum_j [y_j^\mu - \hat{y}_j^\mu]^2$$


  • $\hat{y}$: prediction
  • $y$: true value
  • $j$: output units
  • $\mu$: data points

  • The inside sum, over $j$, says: for each output unit, find the difference between the true value, $y$, and the predicted value from the network $\hat{y}$.

  • The outer sum, over $\mu$, says: for each data point, calculate the inner sum. Namely, the squared differences for each output unit.

Summing up those squared differences for each data point gives the overall error for all output predictions for all data points.

Recall that the output of a neural network, the prediction, depends on the weights:

$$\hat{y}_j^\mu = f \Bigg(\sum_i w_{ij} x_i^\mu \Bigg)$$

Plugging the prediction equation above into the error formula:

$$E=\frac{1}{2} \sum_\mu \sum_j \Bigg[y_j^\mu - f \Bigg(\sum_i w_{ij} x_i^\mu \Bigg)\Bigg]^2$$

Mountain Analogy for the Gradient

Mathematically, the gradient is a derivative generalized to functions with more than one variable. The gradient of the error function can be visualized as a mountain. Since the error is a function of the weights (reference formula above), we take small steps down the mountain by iteratively improving the weights.

The following image is an example of the error of a neural network with two inputs, and accordingly, two weights. This can be read like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

At each step, we calculate the error and the gradient. Using those, we determine how much to change each weight. Repeating this process eventually finds the weights that are close to the minimum of the error function. This is symbolized by the black dot in the middle of the image.

Note that the weights will go to wherever the gradient takes them. So, they can end up where the error is relatvely low, but not the absolute lowest on the overall “mountain.” These spots are referred to as local minima.

There are methods to help avoid this, such as using momentum.

Gradient Descent Math Details

Notes in this section taken on this great lecture courtesy of Udacity.

The following image shows a simple neural network.

Each prediction, $\hat{y}$, is given by the following.

$$\hat{y} = f \big(\sum_i w_i x_i \big)$$

To optimize the model, we can present it with data that we know to be true and set the model parameters, or “weights,” to match that data. In order to optimize the model, we need some notion of how much error the network has relative to some true values.

The obvious choice is the difference between the true values, $y$, and the predicted values, $\hat{y}$. To get rid of negative error values and to “punish” outlier, large errors more than smaller errors, we introduce a square term.

$$E = (y-\hat{y})^2$$

This represents error for a single prediction. To get the error for the entire dataset, we introduce a summation, where $\mu$ represent the data records. So, the summation now represents the summation over the entire dataset. The $\frac{1}{2}$ is introduced for convenience later.

This formulation is known as the “Sum of the Squared Errors,” or SSE for short. The smaller it is, the better the network is performing on the data.

$$E = \frac{1}{2} \sum_\mu \big( y^\mu - \hat{y}^\mu)^2$$

The data records contain input labels, $x$, and target labels, $y$. Remember that the predictions, $\hat{y}$, is the linear combination of the weights, $w_i$, and inputs, $x_i$, passed through the activation function, $f$. So, we can substitute it in here.

$$E = \frac{1}{2} \sum_\mu \big( y^\mu - f(\sum_i w_i x_i^\mu) \big)^2$$

Now we see that the error is a function of the weights. The weights let us tune the network.

What follows is a simple depiction of the error with one weight. Our goal is to find the weight at the bottom of the bowl. Starting at some random weight, we make several steps in the direction of the minimum weight. The direction of that step is called the opposite (negative) of the gradient.

Each of these weight steps is represented in the following equations as $\Delta w_i$. The formula to update the weights is given by:

$$w_i = w_i + \Delta w_i$$

The following says that the weight step is proportional to the negative of the gradient. The gradient is the partial derivative of the error function with respect to each weight, $\frac{\partial E}{\partial w_i}$.

$$\Delta w_i \propto - \frac{\partial E}{\partial w_i}$$

We set an arbitrary scaling factor called the “learning rate,” $\eta$, which allows us to set the size of each gradient descent step.

$$\Delta w_i = - \eta \frac{\partial E}{\partial w_i}$$

Calculating the gradient obviously requires multivariable calculus, since we are taking partial derivatives.

$$\frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2}(y-\hat{y})^2$$

$$= \frac{\partial}{\partial w_i} \frac{1}{2}(y-\hat{y}(w_i))^2$$

Proceeding with taking the derivative requires the chain rule. The chain rule says that taking the derivative of the convolution of two functions is equal to the product of the partial derivative of the outer function with respect to the inner, and the partial derivative of the inner function with respect to the variable it is a function of.

As a specific example:

$$\frac{\partial}{\partial z}p(q(z)) = \frac{\partial p}{\partial q} \frac{\partial q}{\partial z}$$

Now we can plug in the specifics of the derivative we are taking.

$$q = (y - \hat{y}(w_i))$$

$$p = \frac{1}{2}q(w_i)^2$$

$$\frac{\partial E}{\partial w_i} = (y - \hat{y}) \frac{\partial}{\partial w_i}(y-\hat{y})$$

$$=-(y-\hat{y}) \frac{\partial \hat{y}}{\partial w_i}$$

Recall that $\hat{y} = f(h)$, where $h=\sum_i w_i x_i$.

$$\frac{\partial E}{\partial w_i} = - (y-\hat{y}) \frac{\partial \hat{y}}{\partial w_i}$$

$$=-(y-\hat{y}) f’(h) \frac{\partial}{\partial w_i} \sum w_i x_i$$

Note that since $\sum_i w_i x_i = [w_1 x_1 + w_2 x_2 + … + w_n x_n]$, $\frac{\partial}{\partial w_1}\sum_i w_i x_i = x_1 + 0 + … + 0$, so $\frac{\partial}{\partial w_i}\sum_i w_i x_i$.

It follows that

$$\frac{\partial E}{\partial w_i} = -(y-\hat{y}) f’(h) x_i$$

Recalling that $\Delta w_i = - \eta \frac{\partial E}{\partial w_i}$:

$$\Delta w_i = \eta (y - \hat{y}) f’(h) x_i$$

For convenience, we can define an error term, $\delta$.

$$\delta = (y - \hat{y})f’(h)$$

Then, we can write our weight update as

$$w_i = w_i + \eta \delta x_i$$

We might also work with multiple output units. This can be thought of as stacking the architecture from the single output network, but connecting the input units to the new output units.

Now, the total network error would include the error of each output, summed together.

The gradient descent step can be extended to a network with multiple outputs by calculating an error term for each output neuron, denoted with a subscript, $j$.