Neural Networks
This is a continuation of the notes on Perceptrons, which are the basic building blocks for neural networks.
Non-Linear Regions
In some situations, such as that shown below, a non-linear separating line between two classes of data is needed.
An error function is a function that tells the algorithm how far they are from an ideal solution.
A commonly-invoked metaphor is that the space of possible models is a mountain, and the ideal solution is at sea level. The specific model at any point in time is a location on the mountain. The error function in this metaphor is a means for measuring the altitude at that location. Iteratively improving the model means moving the model in a direction that reduces the altitude. IE, this means walking downhill or descending the gradient, hence “gradient descent.”
Continuing the analogy, it is possible that some situations will result in the model ending up in a valley that is still above sea level, a “local minima.” This commonly occurs in machine learning.
There are a few requirements for the error function in order for it to be used in gradient descent. The error function must be:
- Continuous, and not discrete. The shape of the error function must be like a mountain and not like a staircase, and
- Differentiable.
A continuous error function implies that we also need continuous predictions. The distinction between discrete and continuous predictions is illustrated in the following image. In the continuous predictions case, the farther the points are from the black line, the more drastic the probabilities are.
Sigmoid Function
For discrete predictions, step functions like the one shown below are used. For continuous predictions, the sigmoid function is used.
It is simple to obtain this probability space. To do so, combine the linear function
Softmax
The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes. Softmax is defined as follows for linear function scores
See the following example with three classes.
Class | Score ( |
Formula | Probability |
---|---|---|---|
A | 2 | 0.67 | |
B | 1 | 0.24 | |
C | 0 | 0.09 |
As an example, assume the score is
import numpy as np
def softmax(L):
expL = np.exp(L)
sumExpL = sum(expL)
result = []
for i in expL:
result.append(i*1.0/sumExpL)
return result
One-Hot Encoding
One-Hot Encoding is a method to convert categorical variables from strings to numerical representations. The more types of categories, the more columns that are needed.
Maximum Likelihood
Maximum likelihood is a method for choosing the best model given a set of labels. The best model is the one that gives the highest likelihood to the events that actually happened. In other words, we pick that model that gives the existing labels the highest probability.
Consider an example where there are two models predicting the likelihood that a student gets accepted to a university. One predicts 80% probability, the other, 55%. If the student was accepted, then the model that predicted 80% probability would be deemed the better model.
By this method, the model on the right in the image below would be the better model.
We can also derive this result in a more analytically-rigorous fashion. Suppose that the probability of these points being blue is given by the values in the following image.
We know that
For the other model, shown below, the
High values of
Connection Between Maximizing Probability and Minimizing Error
Maximizing the Probability is equivalent to Minimizing the Error Function.
Cross Entropy
We also will take the logarithm of the
The negative logarithms of the probabilities at each point are effectively error values, and the larger the value the “more incorrect” that point is.
Going forward, the logarithms will be natural logarithms. The sum of the negative logs of each of the probabilities is known as the cross entropy. Note that lower probabilities correspond to higher cross entropies.
So, the goal has shifted from maximizing a probability to minimizing a cross entropy in order to create the best model.
Cross Entropy Example
Stated another way, suppose we have a bunch of events and a bunch of probabilities. If it is very likely that those events happen based on those probabilities, then we have a small cross entropy. If it is unlikely, then have a large cross entropy.
Consider an example where there are three doors, and each door has a certain probability of having a gift behind it.
If we assume the most likely outcome for each of the three doors, the overall probability for that arrangement comes out to
The formula for cross-entropy follows, where
The corresponding cross-entropy for the most likely outcomes of the example, above, is therefore:
The probabilities and entropies for each of the possible outcomes for this example are shown in row two of the following table.
Note that high probability outcomes correspond to low entropies, and vice versa. The cross-entropy calculation can be implemented in Python as follows.
import numpy as np
def cross_entropy(Y, P):
Y = np.float_(Y)
P = np.float_(P)
return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
The cross-entropy reduces to being a calculation about how similar or different the outcome and probability vectors are.
Multi-Class Cross Entropy
Consider an example where rather than a binary
The formula for cross entropy in a multi-class example is:
where:
The following outcome will have the probability and cross-entropy shown.
Same as in the binary cross entropy example, cross-entropy is inversely proportional to the total probability of an outcome.
Error Function
Binary Classification
Since
In that formula,
Multi-Class Classification
Minimizing the Error Function
We begin with random weights,
This is accomplished with the gradient descent algorithm.
Gradient Descent
The error function, as discussed above, is a function of the weights of the linear model. Assume there are only two dimensions,
Then, the gradient of
To calculate the gradient, start with the initial prediction,
To avoid making dramatic changes, we introduce a learning rate,
To simplify the calculations, we’ll consider the error that each point produces and take the derivative. The total error is the average of the errors at all the points. The error produced by each point is:
To calculate the derivative of that error with respect to the weights, first calculate
Now, calculate the derivative of the error,
A similar derivation gives:
So, for a point with coordinates
In summary, the gradient is
This means the gradient is a scalar times the coordinates of the point, and that scalar is a multiple of the difference between the label and the prediction. This means that:
- the closer the label is to the prediction, the smaller the gradient, and
- the farther the label is from the prediction, the larger the gradient.
Therefore:
- If a point is well classified, we will get a small gradient.
- If it’s poorly classified, the gradient will be large.
Gradient Descent Step
Since the gradient descent step just conists in subtracting a multiple of the gradient of the error function at every point, the weights update in the following way:
This is equivalent to
The bias updates as well:
Gradient Descent Algorithm
-
Start with random weights:
-
For every point (
):
2.1. For
2.1.1. Update
2.1.2. Update
- Repeat until error is small.
Note that this is very similar to what the perceptron algorithm was doing.
One distinction between the Gradient Descent and Perceptron algorithms pertains to how correctly classified points influence optimizations. In the case of the perceptron algorithm, once correctly classified, the point no longer impacts subsequent optimization steps. In the case of the gradient descent algorithm, correctly classified points continue to influence the boundary line, essentially continuing to “push it away” to further improve the model.
Perceptron
Recall that Perceptrons operate by creating a probability function where the points in the blue region have more chance of being blue. Likewise for the red region.