Neural Networks

This is a continuation of the notes on Perceptrons, which are the basic building blocks for neural networks.

Non-Linear Regions

In some situations, such as that shown below, a non-linear separating line between two classes of data is needed.

An error function is a function that tells the algorithm how far they are from an ideal solution.

A commonly-invoked metaphor is that the space of possible models is a mountain, and the ideal solution is at sea level. The specific model at any point in time is a location on the mountain. The error function in this metaphor is a means for measuring the altitude at that location. Iteratively improving the model means moving the model in a direction that reduces the altitude. IE, this means walking downhill or descending the gradient, hence “gradient descent.”

Continuing the analogy, it is possible that some situations will result in the model ending up in a valley that is still above sea level, a “local minima.” This commonly occurs in machine learning.

There are a few requirements for the error function in order for it to be used in gradient descent. The error function must be:

  • Continuous, and not discrete. The shape of the error function must be like a mountain and not like a staircase, and
  • Differentiable.

A continuous error function implies that we also need continuous predictions. The distinction between discrete and continuous predictions is illustrated in the following image. In the continuous predictions case, the farther the points are from the black line, the more drastic the probabilities are.

Sigmoid Function

For discrete predictions, step functions like the one shown below are used. For continuous predictions, the sigmoid function is used.

It is simple to obtain this probability space. To do so, combine the linear function $Wx + b$ with the sigmoid function. Once we apply the sigmoid function to each of these values in the plane, we obtain numbers from zero to one for each point.


The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes. Softmax is defined as follows for linear function scores $Z_1, …, Z_n$:

$$P(class\ i) = \frac{e^{Z_i}}{e^{Z_1} + … + e^{Z_n}}$$

See the following example with three classes.

Class Score ($Z_i$) Formula Probability
A 2 $\frac{e^2}{e^2 + e^1 + e^0}$ 0.67
B 1 $\frac{e^1}{e^2 + e^1 + e^0}$ 0.24
C 0 $\frac{e^0}{e^2 + e^1 + e^0}$ 0.09

As an example, assume the score is $4x_1 + 5x_2 - 9 = score$. The sigmoid function is $sigmoid(x) = \frac{1}{1+e^{-x}}$, where in this case we substitute $x=score$. So, the points the points (1,1) and (-4,5) will have a 50% probability because at those points the score evaluates to 0. The Softmax calculation can be implemented in Python as follows.

import numpy as np

def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
    return result

One-Hot Encoding

One-Hot Encoding is a method to convert categorical variables from strings to numerical representations. The more types of categories, the more columns that are needed.

Maximum Likelihood

Maximum likelihood is a method for choosing the best model given a set of labels. The best model is the one that gives the highest likelihood to the events that actually happened. In other words, we pick that model that gives the existing labels the highest probability.

Consider an example where there are two models predicting the likelihood that a student gets accepted to a university. One predicts 80% probability, the other, 55%. If the student was accepted, then the model that predicted 80% probability would be deemed the better model.

By this method, the model on the right in the image below would be the better model.

We can also derive this result in a more analytically-rigorous fashion. Suppose that the probability of these points being blue is given by the values in the following image.

We know that $P(red)=1-P(blue)$, and we can calculate the probability that each of the four points are the color that they are: $P(red)=0.1$, $P(blue)=0.6$, $P(red)=0.7$, and $P(blue)=0.2$. Assuming the colors are independent events, $P(all)$ is the product of those individual probabilities.

$$P(all) = 0.1 \times 0.6 \times 0.7 \times 0.2 = 0.0084$$

For the other model, shown below, the $P(all)$ is much higher at 0.3024.

High values of $P(all)$ indicate the model classifies most points correctly with $P(all)$ indicating how accurate the model is.

Connection Between Maximizing Probability and Minimizing Error

Maximizing the Probability is equivalent to Minimizing the Error Function.

Cross Entropy

We also will take the logarithm of the $P(all)$ function so that these products are converted to sums, which will be easier to deal with. This is accomplished by taking the logarithm, a function which has the property that $log(ab)=log(a)+log(b)$.

The negative logarithms of the probabilities at each point are effectively error values, and the larger the value the “more incorrect” that point is.

Going forward, the logarithms will be natural logarithms. The sum of the negative logs of each of the probabilities is known as the cross entropy. Note that lower probabilities correspond to higher cross entropies.

So, the goal has shifted from maximizing a probability to minimizing a cross entropy in order to create the best model.

Cross Entropy Example

Stated another way, suppose we have a bunch of events and a bunch of probabilities. If it is very likely that those events happen based on those probabilities, then we have a small cross entropy. If it is unlikely, then have a large cross entropy.

Consider an example where there are three doors, and each door has a certain probability of having a gift behind it.

If we assume the most likely outcome for each of the three doors, the overall probability for that arrangement comes out to $0.8 \times 0.7 \times 0.9 = 0.504$, or roughly 50%.

The formula for cross-entropy follows, where $y_i$ is $1$ if that outcome was true, $0$ if not.

$$Cross-Entropy = - \sum_{i=1}^m y_i ln(p_i) + (1-y_i) ln(1-p_i)$$

The corresponding cross-entropy for the most likely outcomes of the example, above, is therefore:

$$CE[(1,1,0),(0.8,0.7,0.1)]$$ $$=-ln(0.8) - ln(0.7) - ln(1-0.1)$$ $$=0.69$$

The probabilities and entropies for each of the possible outcomes for this example are shown in row two of the following table.

Note that high probability outcomes correspond to low entropies, and vice versa. The cross-entropy calculation can be implemented in Python as follows.

import numpy as np

def cross_entropy(Y, P):
  Y = np.float_(Y)
  P = np.float_(P)

  return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

The cross-entropy reduces to being a calculation about how similar or different the outcome and probability vectors are.

Multi-Class Cross Entropy

Consider an example where rather than a binary $1$ or $0$ the outcome can be one of several classes.

The formula for cross entropy in a multi-class example is:

$$Cross-Entropy= - \sum^n_{i=1} \sum^m_{j=1} y_{ij} ln(p_{ij})$$


The following outcome will have the probability and cross-entropy shown.

Same as in the binary cross entropy example, cross-entropy is inversely proportional to the total probability of an outcome.

Error Function

Binary Classification

$$Error\ Function = - \frac{1}{m} \sum_{i=1}^m (1-y_i)(ln(1-\hat{y}_i)) + y_iln(\hat{y_i})$$

Since $\hat{y} = \sigma(Wx+b)$, we can substitute as follows.

$$E(W,b) = - \frac{1}{m} \sum_{i=1}^m (1-y_i)(ln(1-\sigma(Wx^{(i)}+b))) + y_iln(\sigma(Wx^{(i)}+b))$$

In that formula, $y_i$ is the label of the point $x^{(i)}$.

Multi-Class Classification

$$Error\ Function = -\frac{1}{m} \sum_{i=1}^m \sum_{j=1}^n y_{ij} ln(\hat{y_{ij}})$$

Minimizing the Error Function

We begin with random weights, $\sigma(Wx+b)$, which has a corresponding error function, $E(W,b)$ given by the formula in the section above. The goal when minimizing the error function is to descend the gradient function. This is accomplished by moving the classification line until we arrive at a configuration with lower error. The separating line is given by $\sigma(W’x+b')$ and the error is given by $E(W',b')$.

This is accomplished with the gradient descent algorithm.

Gradient Descent

The error function, as discussed above, is a function of the weights of the linear model. Assume there are only two dimensions, $W_1$ and $W_2$, for purposes of this explanation.

Then, the gradient of $E$ is given by the vector sum of the partial derivatives of $E$ with respect to $W_1$ and $W_2$. This gradient is the direction on the error function that we would move to increase the error function, so we use the negative: $-\nabla E$. To reduce the error, we move in the direction of the negative gradient.

To calculate the gradient, start with the initial prediction, $\hat{y}$:

$$\hat{y}=\sigma(Wx+b)$$ $$\hat{y}=\sigma(W_1 x_1 + … + W_n X_n + b)$$ $$\nabla E = (\frac{\partial E}{\partial W_1}, …, \frac{\partial E}{\partial W_n}, \frac{\partial E}{\partial W_b})$$

To avoid making dramatic changes, we introduce a learning rate, $\alpha$:

$$\alpha = 0.1$$ $$W'_i \leftarrow W_i - \alpha \frac{\partial E}{\partial W_i}$$ $$b' \leftarrow b - \alpha \frac{\partial E}{\partial b}$$

To simplify the calculations, we’ll consider the error that each point produces and take the derivative. The total error is the average of the errors at all the points. The error produced by each point is:

$$E= -y ln(\hat{y}) - (1 - y) ln(1-\hat{y})$$

To calculate the derivative of that error with respect to the weights, first calculate $\frac{\partial}{\partial w_j} \hat{y}$. We know that $\hat{y}=\sigma(Wx+b)$ and the derivative of the sigmoid function is $\sigma'(x)=\sigma(x)(1-\sigma(x))$. Via a short derivation:

$$\frac{\partial}{\partial w_j} \hat{y} = \frac{\partial}{\partial w_j} \sigma (Wx+b) = … = \hat{y} {(1-\hat{y})} x_j$$

Now, calculate the derivative of the error, $E$, at the point $x$ with respect to the weight $w_j$. Another short derivation gives:

$$\frac{\partial}{\partial w_j} E = \frac{\partial}{\partial w_j} [-y log(\hat{y}) - (1-y)log(1-\hat{y})] = … = -(y-\hat{y})x_j$$

A similar derivation gives:

$$\frac{\partial}{\partial b} E = -(y-\hat{y})$$

So, for a point with coordinates $(x_1, …, x_n)$, label $y$, and prediction $\hat{y}$, the gradient at that point is $(-(y-\hat{y}x_1, …, -(y-\hat{y})x_n, -(y-\hat{y})))$.

In summary, the gradient is $\nabla E = -(y-\hat{y})(x_1,…,x_n,1)$.

This means the gradient is a scalar times the coordinates of the point, and that scalar is a multiple of the difference between the label and the prediction. This means that:

  • the closer the label is to the prediction, the smaller the gradient, and
  • the farther the label is from the prediction, the larger the gradient.


  • If a point is well classified, we will get a small gradient.
  • If it’s poorly classified, the gradient will be large.

Gradient Descent Step

Since the gradient descent step just conists in subtracting a multiple of the gradient of the error function at every point, the weights update in the following way:

$$w'_i \leftarrow w_i - \alpha[-(y-\hat{y})x_i]$$

This is equivalent to

$$w'_i \leftarrow w_i + \alpha(y+\hat{y})x_i$$

The bias updates as well:

$$b'_i \leftarrow b_i + \alpha(y+\hat{y})x_i$$

Gradient Descent Algorithm

  1. Start with random weights: $w_1,…,w_n,b$

  2. For every point ($x_1,…,x_n$):

            2.1. For $i=1 … n$

                  2.1.1. Update $w'_i \leftarrow w_i - \alpha (\hat{y} - y)x_i$

                  2.1.2. Update $b'_i \leftarrow b_i - \alpha (\hat{y} - y)x_i$

  1. Repeat until error is small.

Note that this is very similar to what the perceptron algorithm was doing.

One distinction between the Gradient Descent and Perceptron algorithms pertains to how correctly classified points influence optimizations. In the case of the perceptron algorithm, once correctly classified, the point no longer impacts subsequent optimization steps. In the case of the gradient descent algorithm, correctly classified points continue to influence the boundary line, essentially continuing to “push it away” to further improve the model.


Recall that Perceptrons operate by creating a probability function where the points in the blue region have more chance of being blue. Likewise for the red region.