Gradient Descent Implementations
Single Update Example
From the gradient descent notes, we saw that one weight update is calculated as:
The error term
In the error term,
import numpy as np
def sigmoid(x):
"""
Calculate sigmoid
"""
return 1/(1+np.exp(-x))
def sigmoid_prime(x):
"""
# Derivative of the sigmoid function
"""
return sigmoid(x) * (1 - sigmoid(x))
learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)
# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])
Calculate one gradient descent step for each weight
# Calculate the node's linear combination of inputs and weights
h = np.dot(x, w)
# Calculate output of neural network
nn_output = sigmoid(h)
# Calculate error of neural network
error = y - nn_output
# Calculate the error term
error_term = error * sigmoid_prime(h)
# Calculate change in weights
del_w = learnrate * error_term * x
print('Neural Network output: {:1.3f}'.format(nn_output))
print(' Amount of Error: {:1.3f}'.format(error))
print(' Change in Weights: ' + str(del_w))
Neural Network output: 0.690
Amount of Error: -0.190
Change in Weights: [-0.02031869 -0.04063738 -0.06095608 -0.08127477]
Graduate School Admissions Example
From this point, this page is very similar to the Neural Network Admissions Example. It relies on the same base data and contains much of the same code.
The code in the previous section demonstrated how to update weights for a single update:
The next section will demonstrate how to perform many updates to fully optimize a neural network.
Data Cleanup and Splitting
The data must be cleaned up and transformed before it can be used. This cleanup consists of two steps:
- Standardize GRE and GPA, which makes them have a zero mean and a standard deviation of 1.
- Encode the Rank categorical variable [(1, 2, 3, 4)] using one-hot encoding [(1, 0, 0, 0), (0, 1, 0, 0), …].
import pandas as pd
admissions = pd.read_csv('gradient-descent-implementations/data/data.csv')
print('raw admissions: ' + str(admissions.shape[0]) + ' rows')
admissions.head()
raw admissions: 400 rows
admit | gre | gpa | rank | |
---|---|---|---|---|
0 | 0 | 380 | 3.61 | 3 |
1 | 1 | 660 | 3.67 | 3 |
2 | 1 | 800 | 4.00 | 1 |
3 | 1 | 640 | 3.19 | 4 |
4 | 0 | 520 | 2.93 | 4 |
# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')],
axis=1)
data = data.drop(columns=['rank'])
# Standarize features
for field in ['gre', 'gpa']:
mean, std = data[field].mean(), data[field].std()
data.loc[:,field] = (data[field]-mean)/std
print('data: ' + str(data.shape[0]) + ' rows')
data.head()
data: 400 rows
admit | gre | gpa | rank_1 | rank_2 | rank_3 | rank_4 | |
---|---|---|---|---|---|---|---|
0 | 0 | -1.798011 | 0.578348 | 0 | 0 | 1 | 0 |
1 | 1 | 0.625884 | 0.736008 | 0 | 0 | 1 | 0 |
2 | 1 | 1.837832 | 1.603135 | 1 | 0 | 0 | 0 |
3 | 1 | 0.452749 | -0.525269 | 0 | 0 | 0 | 1 |
4 | 0 | -0.586063 | -1.208461 | 0 | 0 | 0 | 1 |
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.iloc[sample], data.drop(sample)
# Split into features and targets
features, targets = data.drop(columns=['admit']), data['admit']
features_test, targets_test = test_data.drop(columns=['admit']), test_data['admit']
print(' features: ' + str(features.shape[0]) + ' rows')
print('features_test: ' + str(features_test.shape[0]) + ' rows')
features.head()
features: 360 rows
features_test: 40 rows
gre | gpa | rank_1 | rank_2 | rank_3 | rank_4 | |
---|---|---|---|---|---|---|
209 | -0.066657 | 0.289305 | 0 | 1 | 0 | 0 |
280 | 0.625884 | 1.445476 | 0 | 1 | 0 | 0 |
33 | 1.837832 | 1.603135 | 0 | 0 | 1 | 0 |
210 | 1.318426 | -0.131120 | 0 | 0 | 0 | 1 |
93 | -0.066657 | -1.208461 | 0 | 1 | 0 | 0 |
Mean Squared Error
Instead of using the Sum of Squared Errors (SSE), we will be using the Mean Squared Error (MSE). When using larger datasets, summing up all the weight steps can lead to very large updates that make the gradient descent diverge. To compensate for this, it would be necessary to use a very small learning rate.
Instead of using this small learning rate, we can just divide by the number of records in the data,
General algorithm for updating the weights with gradient descent
- Set the weight step to zero:
- For each record in the training data:
- Make a forward pass through the network, calculating the output:
- Calculate the error term for the output unit:
- Update the weight step
- Make a forward pass through the network, calculating the output:
- Update the weights
where: is the learning rate, is the number of records. Average the weight steps to reduce any large variations in the training data. - Repeat for
epochs.
An alternative approach would be to update the weights on each record instead of average the weight steps after going through all the records.
Recall that we are using sigmoid as the activation function,
The gradient of the sigmoid is
Implemention Notes
The first step in the implementation is to initialize the weights. These need to be:
- small, so they are in the linear region near zero and not at the high and low ends,
- random, so they have different starting values and diverge.
A normal distribution centered at 0 with scale
n_records, n_features = features.shape
print('n_records, n_features: {}, {}'.format(n_records, n_features))
weights = np.random.normal(scale=1/n_features**.5, size=n_features)
n_records, n_features: 360, 6
The input to the output layer will be calculated using Numpy’s dot()
function, which performs multiplication on two arrays element-wise (1st element
# input to the output layer
output_in = np.dot(weights, features.iloc[0])
Implementing Gradient Descent
Setup
weights_df, delta_weights_df = (pd.DataFrame(columns=features.columns),
pd.DataFrame(columns=features.columns))
def sigmoid(x):
"""
Calculate sigmoid
"""
return 1 / (1 + np.exp(-x))
# Use to same seed to make debugging easier
np.random.seed(42)
n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
# Neural Network hyperparameters
epochs = 5001
learnrate = 0.1
Gradient Descent
for e in range(epochs):
del_w = np.zeros(weights.shape)
for x, y in zip(features.values, targets):
# Loop through all records, x is the input, y is the target
output = sigmoid(np.dot(weights, x))
error = y-output
error_term = error * output * (1 - output)
del_w += error_term * x
weights += learnrate * del_w / n_records
# Printing out the mean square error on the training set
if e % (int(epochs / 10)) == 0:
out = sigmoid(np.dot(features, weights))
loss = np.mean((out - targets) ** 2)
if last_loss and last_loss < loss:
print("Epoch {:4} - Train loss: {:1.5f} WARNING - Loss Increasing"
.format(e,loss))
else:
print("Epoch {:4} - Train loss: {:1.5f}".format(e,loss))
last_loss = loss
# Save weights and weight change for plotting later
weights_df = (weights_df
.append(
pd.DataFrame(weights.reshape(1,-1),
columns=features.columns)))
delta_weights_df = (delta_weights_df
.append(
pd.DataFrame(del_w.reshape(1,-1),
columns=features.columns)))
Epoch 0 - Train loss: 0.26390
Epoch 500 - Train loss: 0.20948
Epoch 1000 - Train loss: 0.20090
Epoch 1500 - Train loss: 0.19864
Epoch 2000 - Train loss: 0.19781
Epoch 2500 - Train loss: 0.19743
Epoch 3000 - Train loss: 0.19724
Epoch 3500 - Train loss: 0.19713
Epoch 4000 - Train loss: 0.19707
Epoch 4500 - Train loss: 0.19703
Epoch 5000 - Train loss: 0.19701
Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("\nPrediction accuracy: {:2.1%}".format(accuracy))
Prediction accuracy: 72.5%
Visualize
import matplotlib.pyplot as plt
weights_df = weights_df.reset_index(drop=True)
delta_weights_df = delta_weights_df.reset_index(drop=True)
plt.figure(figsize=[11,8.5])
for col in weights_df.columns:
plt.plot(weights_df.index,
weights_df[[col]],
label=col);
plt.title('Network Weights', fontsize=14)
plt.xlabel('Epoch', fontsize=12)
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.legend(
fancybox=True,
loc=9,
ncol=6,
facecolor='white',
edgecolor='white',
framealpha=1,
fontsize=12);
plt.figure(figsize=[11,8.5])
for col in weights_df.columns:
plt.plot(delta_weights_df.index,
delta_weights_df[[col]],
label=col);
plt.title('Weight Change', fontsize=14)
plt.xlabel('Epoch', fontsize=12)
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.legend(
fancybox=True,
loc=9,
ncol=6,
facecolor='white',
edgecolor='white',
framealpha=1,
fontsize=12);