Pytorch is a framework for building and training neural networks. PyTorch in a lot of ways behaves like arrays from Numpy. These Numpy arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for the faster processing needed when training neural networks. It also provides a module that automatically calculates gradients (for backpropagation!) and another module specifically for building neural networks.

All together, PyTorch ends up being more coherent with Python and the Numpy/Scipy stack compared to TensorFlow and other frameworks.

Neural Networks

Deep Learning is based on artificial neural networks which have been around in some form since the late 1950s. The networks are built from individual parts approximating neurons, typically called units or simply “neurons.” Each unit has some number of weighted inputs. These weighted inputs are summed together (a linear combination) then passed through an activation function to get the unit’s output.

Mathematically this looks like:

$$ \begin{align} y &= f(w_1 x_1 + w_2 x_2 + b) \cr y &= f\left(\sum_i w_i x_i +b \right) \end{align} $$

With vectors this is the dot/inner product of two vectors:

$$ h = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \cdot \begin{bmatrix} w_1 \cr w_2 \cr \vdots \cr w_n \end{bmatrix} $$


It turns out neural network computations are just a bunch of linear algebra operations on tensors, a generalization of matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor (RGB color images for example). The fundamental data structure for neural networks are tensors and PyTorch (as well as pretty much every other deep learning framework) is built around tensors.

With the basics covered, it’s time to explore how we can use PyTorch to build a simple neural network.

import torch  # PyTorch
def activation(x):
    """ Sigmoid activation function 
        x: torch.Tensor
    return 1/(1+torch.exp(-x))

Set the random seed so things are predictable. The Features and Weights are each 5 random normal variables. The Bias is a single value


features = torch.randn((1, 5))
weights = torch.randn_like(features)
bias = torch.randn((1, 1))
print_dict = [('features', features), 
              ('weights', weights),
              ('bias', bias)]
[print(name+':\n\t'+str(tensor.shape)+'\n\t'+str(tensor)) for name, tensor in print_dict];
	torch.Size([1, 5])
	tensor([[-0.1468,  0.7861,  0.9468, -1.1143,  1.6908]])
	torch.Size([1, 5])
	tensor([[-0.8948, -0.3556,  1.2324,  0.1382, -1.6822]])
	torch.Size([1, 1])

Above I generated data we can use to get the output of our simple network. This is all just random for now, going forward we’ll start using normal data. Going through each relevant line:

features = torch.randn((1, 5)) creates a tensor with shape (1, 5), one row and five columns, that contains values randomly distributed according to the normal distribution with a mean of zero and standard deviation of one.

weights = torch.randn_like(features) creates another tensor with the same shape as features, again containing values from a normal distribution.

Finally, bias = torch.randn((1, 1)) creates a single value from a normal distribution.

PyTorch tensors can be added, multiplied, subtracted, etc, just like Numpy arrays. In general, PyTorch tensors can be used pretty much the same way you’d use Numpy arrays. They come with some nice benefits though such as GPU acceleration. For now, use the generated data to calculate the output of this simple single layer network.

Calculate the output of this network using the weights and bias tensors

output_layer_input = (features * weights).sum() + bias
output = activation(output_layer_input)
print_dict = [('output_layer_input', output_layer_input), 
              ('output', output)]
[print(name+':\n\t'+str(tensor.shape)+'\n\t'+str(tensor)) for name, tensor in print_dict];
	torch.Size([1, 1])
	torch.Size([1, 1])

You can do the multiplication and sum in the same operation using a matrix multiplication. In general, it is better to use matrix multiplications since they are more efficient and accelerated using modern libraries and high-performance computing on GPUs.

Here, we want to do a matrix multiplication of the features and the weights. For this we can use torch.mm() or torch.matmul() which is somewhat more complicated and supports broadcasting. If we try to do it with features and weights as they are, we’ll get an error

>> torch.mm(features, weights)

RuntimeError                              Traceback (most recent call last)
<ipython-input-13-15d592eb5279> in <module>()
----> 1 torch.mm(features, weights)

RuntimeError: size mismatch, m1: [1 x 5], m2: [1 x 5] at ...

As you’re building neural networks in any framework, you’ll see this often. What’s happening is our tensors aren’t the correct shapes to perform a matrix multiplication. Remember that for matrix multiplications, the number of columns in the first tensor must equal to the number of rows in the second column. Both features and weights have the same shape, (1, 5). This means we need to change the shape of weights to get the matrix multiplication to work.

Note: To see the shape of a tensor called tensor, use tensor.shape. If you’re building neural networks, you’ll be using this method often.

There are a few options here: weights.reshape(), weights.resize_(), and weights.view().

  • weights.reshape(a, b) will return a new tensor with the same data as weights with size (a, b) sometimes, and sometimes a clone, as in it copies the data to another part of memory.

  • weights.resize_(a, b) returns the same tensor with a different shape.

    • If the new shape results in fewer elements than the original tensor, some elements will be removed from the tensor (but not from memory).
    • If the new shape results in more elements than the original tensor, new elements will be uninitialized in memory.
  • weights.view(a, b) will return a new tensor with the same data as weights with size (a, b).

Here I should note that the underscore at the end of the method denotes that this method is performed in-place. Here is a great forum thread to read more about in-place operations in PyTorch.

output_layer_input = torch.mm(features, weights.view(5,1)).sum() + bias
output = activation(output_layer_input)
print_dict = [('output', output)]
[print(name+':\n\t'+str(tensor.shape)+'\n\t'+str(tensor)) for name, tensor in print_dict];
	torch.Size([1, 1])

Stack them up!

That’s how you can calculate the output for a single neuron. The real power of this algorithm happens when you start stacking these individual units into layers and stacks of layers, into a network of neurons. The output of one layer of neurons becomes the input for the next layer. With multiple input units and output units, we now need to express the weights as a matrix.

The first layer shown on the bottom here are the inputs, understandably called the input layer. The middle layer is called the hidden layer, and the final layer (on the right) is the output layer. We can express this network mathematically with matrices again and use matrix multiplication to get linear combinations for each unit in one operation. For example, the hidden layer ($h_1$ and $h_2$ here) can be calculated

$$ \vec{h} = [h_1 , h_2] = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \cdot \begin{bmatrix} w_{11} & w_{12} \cr w_{21} & w_{22} \cr \vdots & \vdots \cr w_{n1} & w_{n2} \end{bmatrix} $$

The output for this small network is found by treating the hidden layer as inputs for the output unit. The network output is expressed simply

$$ y = f_2 \left(f_1 \left(\vec{x} \mathbf{W_1} \right) \mathbf{W_2} \right) $$

Set the random seed so the following is predictable. The features are 3 random normal variables. The hidden layer is 2 random normal variables, and the output is a single node.


features = torch.randn((1, 3))
n_input = features.shape[1]     # Number of input units, matches number of input features
n_hidden = 2                    # Number of hidden units 
n_output = 1                    # Number of output units

W1 = torch.randn(n_input, n_hidden)  # Weights for inputs to hidden layer
W2 = torch.randn(n_hidden, n_output) # Weights for hidden layer to output layer

B1 = torch.randn((1, n_hidden)) # Bias terms for hidden and output layers
B2 = torch.randn((1, n_output))
print_dict = [('features', features), 
              ('W1', W1), 
              ('W2', W2), 
              ('B1', B1), 
              ('B1', B2)]
[print(name+':\n\t'+str(tensor.shape)+'\n\t'+str(tensor)) for name, tensor in print_dict];
	torch.Size([1, 3])
	tensor([[-0.1468,  0.7861,  0.9468]])
	torch.Size([3, 2])
	tensor([[-1.1143,  1.6908],
        [-0.8948, -0.3556],
        [ 1.2324,  0.1382]])
	torch.Size([2, 1])
        [ 0.3177]])
	torch.Size([1, 2])
	tensor([[0.1328, 0.1373]])
	torch.Size([1, 1])

Calculate the output of the network.

hidden = activation(torch.mm(features, W1) + B1)
output = activation(torch.mm(hidden, W2) + B2)
print_dict = [('hidden', hidden), 
              ('output', output)]
[print(name+':\n\t'+str(tensor.shape)+'\n\t'+str(tensor)) for name, tensor in print_dict];
	torch.Size([1, 2])
	tensor([[0.6813, 0.4355]])
	torch.Size([1, 1])

The number of hidden units a parameter of the network, often called a hyperparameter to differentiate it from the weights and biases parameters. As you’ll see later when we discuss training a neural network, the more hidden units a network has, and the more layers, the better able it is to learn from data and make accurate predictions.

Numpy to Torch and back

PyTorch has a great feature for converting between Numpy arrays and Torch tensors. To create a tensor from a Numpy array, use torch.from_numpy(). To convert a tensor to a Numpy array, use the .numpy() method.

import numpy as np
a = np.random.rand(4,3)
array([[0.97449152, 0.68652824, 0.51496251],
       [0.46857383, 0.2098441 , 0.58842778],
       [0.76876154, 0.35381711, 0.12912342],
       [0.69358891, 0.89462071, 0.86406838]])
b = torch.from_numpy(a)
tensor([[0.9745, 0.6865, 0.5150],
        [0.4686, 0.2098, 0.5884],
        [0.7688, 0.3538, 0.1291],
        [0.6936, 0.8946, 0.8641]], dtype=torch.float64)
array([[0.97449152, 0.68652824, 0.51496251],
       [0.46857383, 0.2098441 , 0.58842778],
       [0.76876154, 0.35381711, 0.12912342],
       [0.69358891, 0.89462071, 0.86406838]])

The memory is shared between the Numpy array and Torch tensor, so if you change the values in-place of one object, the other will change as well.

# Multiply PyTorch Tensor by 2, in place
tensor([[1.9490, 1.3731, 1.0299],
        [0.9371, 0.4197, 1.1769],
        [1.5375, 0.7076, 0.2582],
        [1.3872, 1.7892, 1.7281]], dtype=torch.float64)
# Numpy array matches new values from Tensor
array([[1.94898303, 1.37305649, 1.02992502],
       [0.93714766, 0.4196882 , 1.17685556],
       [1.53752308, 0.70763422, 0.25824684],
       [1.38717781, 1.78924142, 1.72813676]])