Neural Networks
This is a continuation of the notes on Perceptrons, which are the basic building blocks for neural networks.
NonLinear Regions
In some situations, such as that shown below, a nonlinear separating line between two classes of data is needed.
An error function is a function that tells the algorithm how far they are from an ideal solution.
A commonlyinvoked metaphor is that the space of possible models is a mountain, and the ideal solution is at sea level. The specific model at any point in time is a location on the mountain. The error function in this metaphor is a means for measuring the altitude at that location. Iteratively improving the model means moving the model in a direction that reduces the altitude. IE, this means walking downhill or descending the gradient, hence “gradient descent.”
Continuing the analogy, it is possible that some situations will result in the model ending up in a valley that is still above sea level, a “local minima.” This commonly occurs in machine learning.
There are a few requirements for the error function in order for it to be used in gradient descent. The error function must be:
 Continuous, and not discrete. The shape of the error function must be like a mountain and not like a staircase, and
 Differentiable.
A continuous error function implies that we also need continuous predictions. The distinction between discrete and continuous predictions is illustrated in the following image. In the continuous predictions case, the farther the points are from the black line, the more drastic the probabilities are.
Sigmoid Function
For discrete predictions, step functions like the one shown below are used. For continuous predictions, the sigmoid function is used.
It is simple to obtain this probability space. To do so, combine the linear function $Wx + b$ with the sigmoid function. Once we apply the sigmoid function to each of these values in the plane, we obtain numbers from zero to one for each point.
Softmax
The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes. Softmax is defined as follows for linear function scores $Z_1, …, Z_n$:
$$P(class\ i) = \frac{e^{Z_i}}{e^{Z_1} + … + e^{Z_n}}$$
See the following example with three classes.
Class  Score ($Z_i$)  Formula  Probability 

A  2  $\frac{e^2}{e^2 + e^1 + e^0}$  0.67 
B  1  $\frac{e^1}{e^2 + e^1 + e^0}$  0.24 
C  0  $\frac{e^0}{e^2 + e^1 + e^0}$  0.09 
As an example, assume the score is $4x_1 + 5x_2  9 = score$. The sigmoid function is $sigmoid(x) = \frac{1}{1+e^{x}}$, where in this case we substitute $x=score$. So, the points the points (1,1) and (4,5) will have a 50% probability because at those points the score evaluates to 0. The Softmax calculation can be implemented in Python as follows.
import numpy as np
def softmax(L):
expL = np.exp(L)
sumExpL = sum(expL)
result = []
for i in expL:
result.append(i*1.0/sumExpL)
return result
OneHot Encoding
OneHot Encoding is a method to convert categorical variables from strings to numerical representations. The more types of categories, the more columns that are needed.
Maximum Likelihood
Maximum likelihood is a method for choosing the best model given a set of labels. The best model is the one that gives the highest likelihood to the events that actually happened. In other words, we pick that model that gives the existing labels the highest probability.
Consider an example where there are two models predicting the likelihood that a student gets accepted to a university. One predicts 80% probability, the other, 55%. If the student was accepted, then the model that predicted 80% probability would be deemed the better model.
By this method, the model on the right in the image below would be the better model.
We can also derive this result in a more analyticallyrigorous fashion. Suppose that the probability of these points being blue is given by the values in the following image.
We know that $P(red)=1P(blue)$, and we can calculate the probability that each of the four points are the color that they are: $P(red)=0.1$, $P(blue)=0.6$, $P(red)=0.7$, and $P(blue)=0.2$. Assuming the colors are independent events, $P(all)$ is the product of those individual probabilities.
$$P(all) = 0.1 \times 0.6 \times 0.7 \times 0.2 = 0.0084$$
For the other model, shown below, the $P(all)$ is much higher at 0.3024.
High values of $P(all)$ indicate the model classifies most points correctly with $P(all)$ indicating how accurate the model is.
Connection Between Maximizing Probability and Minimizing Error
Maximizing the Probability is equivalent to Minimizing the Error Function.
Cross Entropy
We also will take the logarithm of the $P(all)$ function so that these products are converted to sums, which will be easier to deal with. This is accomplished by taking the logarithm, a function which has the property that $log(ab)=log(a)+log(b)$.
The negative logarithms of the probabilities at each point are effectively error values, and the larger the value the “more incorrect” that point is.
Going forward, the logarithms will be natural logarithms. The sum of the negative logs of each of the probabilities is known as the cross entropy. Note that lower probabilities correspond to higher cross entropies.
So, the goal has shifted from maximizing a probability to minimizing a cross entropy in order to create the best model.
Cross Entropy Example
Stated another way, suppose we have a bunch of events and a bunch of probabilities. If it is very likely that those events happen based on those probabilities, then we have a small cross entropy. If it is unlikely, then have a large cross entropy.
Consider an example where there are three doors, and each door has a certain probability of having a gift behind it.
If we assume the most likely outcome for each of the three doors, the overall probability for that arrangement comes out to $0.8 \times 0.7 \times 0.9 = 0.504$, or roughly 50%.
The formula for crossentropy follows, where $y_i$ is $1$ if that outcome was true, $0$ if not.
$$CrossEntropy =  \sum_{i=1}^m y_i ln(p_i) + (1y_i) ln(1p_i)$$
The corresponding crossentropy for the most likely outcomes of the example, above, is therefore:
$$CE[(1,1,0),(0.8,0.7,0.1)]$$ $$=ln(0.8)  ln(0.7)  ln(10.1)$$ $$=0.69$$
The probabilities and entropies for each of the possible outcomes for this example are shown in row two of the following table.
Note that high probability outcomes correspond to low entropies, and vice versa. The crossentropy calculation can be implemented in Python as follows.
import numpy as np
def cross_entropy(Y, P):
Y = np.float_(Y)
P = np.float_(P)
return np.sum(Y * np.log(P) + (1  Y) * np.log(1  P))
The crossentropy reduces to being a calculation about how similar or different the outcome and probability vectors are.
MultiClass Cross Entropy
Consider an example where rather than a binary $1$ or $0$ the outcome can be one of several classes.
The formula for cross entropy in a multiclass example is:
$$CrossEntropy=  \sum^n_{i=1} \sum^m_{j=1} y_{ij} ln(p_{ij})$$
where:
The following outcome will have the probability and crossentropy shown.
Same as in the binary cross entropy example, crossentropy is inversely proportional to the total probability of an outcome.
Error Function
Binary Classification
$$Error\ Function =  \frac{1}{m} \sum_{i=1}^m (1y_i)(ln(1\hat{y}_i)) + y_iln(\hat{y_i})$$
Since $\hat{y} = \sigma(Wx+b)$, we can substitute as follows.
$$E(W,b) =  \frac{1}{m} \sum_{i=1}^m (1y_i)(ln(1\sigma(Wx^{(i)}+b))) + y_iln(\sigma(Wx^{(i)}+b))$$
In that formula, $y_i$ is the label of the point $x^{(i)}$.
MultiClass Classification
$$Error\ Function = \frac{1}{m} \sum_{i=1}^m \sum_{j=1}^n y_{ij} ln(\hat{y_{ij}})$$
Minimizing the Error Function
We begin with random weights, $\sigma(Wx+b)$, which has a corresponding error function, $E(W,b)$ given by the formula in the section above. The goal when minimizing the error function is to descend the gradient function. This is accomplished by moving the classification line until we arrive at a configuration with lower error. The separating line is given by $\sigma(W’x+b')$ and the error is given by $E(W',b')$.
This is accomplished with the gradient descent algorithm.
Gradient Descent
The error function, as discussed above, is a function of the weights of the linear model. Assume there are only two dimensions, $W_1$ and $W_2$, for purposes of this explanation.
Then, the gradient of $E$ is given by the vector sum of the partial derivatives of $E$ with respect to $W_1$ and $W_2$. This gradient is the direction on the error function that we would move to increase the error function, so we use the negative: $\nabla E$. To reduce the error, we move in the direction of the negative gradient.
To calculate the gradient, start with the initial prediction, $\hat{y}$:
$$\hat{y}=\sigma(Wx+b)$$ $$\hat{y}=\sigma(W_1 x_1 + … + W_n X_n + b)$$ $$\nabla E = (\frac{\partial E}{\partial W_1}, …, \frac{\partial E}{\partial W_n}, \frac{\partial E}{\partial W_b})$$
To avoid making dramatic changes, we introduce a learning rate, $\alpha$:
$$\alpha = 0.1$$ $$W'_i \leftarrow W_i  \alpha \frac{\partial E}{\partial W_i}$$ $$b' \leftarrow b  \alpha \frac{\partial E}{\partial b}$$
To simplify the calculations, we’ll consider the error that each point produces and take the derivative. The total error is the average of the errors at all the points. The error produced by each point is:
$$E= y ln(\hat{y})  (1  y) ln(1\hat{y})$$
To calculate the derivative of that error with respect to the weights, first calculate $\frac{\partial}{\partial w_j} \hat{y}$. We know that $\hat{y}=\sigma(Wx+b)$ and the derivative of the sigmoid function is $\sigma'(x)=\sigma(x)(1\sigma(x))$. Via a short derivation:
$$\frac{\partial}{\partial w_j} \hat{y} = \frac{\partial}{\partial w_j} \sigma (Wx+b) = … = \hat{y} {(1\hat{y})} x_j$$
Now, calculate the derivative of the error, $E$, at the point $x$ with respect to the weight $w_j$. Another short derivation gives:
$$\frac{\partial}{\partial w_j} E = \frac{\partial}{\partial w_j} [y log(\hat{y})  (1y)log(1\hat{y})] = … = (y\hat{y})x_j$$
A similar derivation gives:
$$\frac{\partial}{\partial b} E = (y\hat{y})$$
So, for a point with coordinates $(x_1, …, x_n)$, label $y$, and prediction $\hat{y}$, the gradient at that point is $((y\hat{y}x_1, …, (y\hat{y})x_n, (y\hat{y})))$.
In summary, the gradient is $\nabla E = (y\hat{y})(x_1,…,x_n,1)$.
This means the gradient is a scalar times the coordinates of the point, and that scalar is a multiple of the difference between the label and the prediction. This means that:
 the closer the label is to the prediction, the smaller the gradient, and
 the farther the label is from the prediction, the larger the gradient.
Therefore:
 If a point is well classified, we will get a small gradient.
 If it’s poorly classified, the gradient will be large.
Gradient Descent Step
Since the gradient descent step just conists in subtracting a multiple of the gradient of the error function at every point, the weights update in the following way:
$$w'_i \leftarrow w_i  \alpha[(y\hat{y})x_i]$$
This is equivalent to
$$w'_i \leftarrow w_i + \alpha(y+\hat{y})x_i$$
The bias updates as well:
$$b'_i \leftarrow b_i + \alpha(y+\hat{y})x_i$$
Gradient Descent Algorithm

Start with random weights: $w_1,…,w_n,b$

For every point ($x_1,…,x_n$):
2.1. For $i=1 … n$
2.1.1. Update $w'_i \leftarrow w_i  \alpha (\hat{y}  y)x_i$
2.1.2. Update $b'_i \leftarrow b_i  \alpha (\hat{y}  y)x_i$
 Repeat until error is small.
Note that this is very similar to what the perceptron algorithm was doing.
One distinction between the Gradient Descent and Perceptron algorithms pertains to how correctly classified points influence optimizations. In the case of the perceptron algorithm, once correctly classified, the point no longer impacts subsequent optimization steps. In the case of the gradient descent algorithm, correctly classified points continue to influence the boundary line, essentially continuing to “push it away” to further improve the model.
Perceptron
Recall that Perceptrons operate by creating a probability function where the points in the blue region have more chance of being blue. Likewise for the red region.