# Essentials

This “PyTorch essentials” page has notes from most of the other sections of the Deep Learning with PyTorch section of the notes page. The other pages are modified versions of scripts provided by Udacity and can be verbose. This page strips out the commentary from those pages leaving just the essential information.

#### Background

PyTorch is an open source library for building deep learning models. Its been developed by the Facebook AI Research team. It was first released in early 2017 and has made a big impact on the deep learning community.

The following topics are covered below and in the linked pages:

• Tensors, the main data structure of PyTorch. The notes will cover how to create tensors, use them to do simple operations, and how tensors and NumPy interact.
• Autograd, a module PyTorch uses to calculate gradients for training neural networks. It does all the work of backpropagation.
• Network Validation, a process to test that the network is able to generalize to new data.
• Transfer Learning, a way to use pre-trained networks to improve the performance of a classifier on more complex images.
• GPU accelerated network computations, specifically, PyTorch uses a library called CUDA.

On Linux, PyTorch and related libraries can be installed with Anaconda:

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

## 1. Tensors

#### Tensors

The inputs to a neural network and the weights the network contains can be conceptualized as vectors. Then, the output of a vector is the dot product of those vectors, plus the bias term.

Tensors are a generalization of vectors and matrices. Equivalently, vectors and matrices are specific types of tensor. Vectors are generally 1D, for example length 6. A matrix is generally 2D, for example, 6 by 4 or [6,4]. A 3D tensor (for example, 6 by 4 by 3 or [6,4,3]) could be used to model RGB color images.

Tensors are the base data structure used in PyTorch and other neural network frameworks. Tensorflow, for example, is named after tensors.

#### PyTorch Basics

• The sigmoid activation function squeezes its input to a value between 0 and 1, and is useful for calculating probabilities.
• torch.randn((1, 5)) returns a vector of random, normal values with the specified shape.
• torch.randn_like(features) returns a tensor of random, normal values with the shape of the tensor provided as the input.

In order to calculate the output of a simple neural network, we could use a line like the following:

y = activation(torch.sum(features * weights) + bias)

y = activation((features * weights).sum() + bias)


The lines above used element-wise multiplication followed by taking the sum of those values. An alternative approach would be to use matrix multiplication instead. In general, this is the preferred way to perform multiplication because it is more efficient. The linear algebra that is being performed has been accelerated using modern libraries such as CUDA that run on GPUs.

There are two methods key methods for matrix multiplication:

• torch.mm() - the preferred method, because it strictly enforces matrix size rules. Recall that in matrix multiplication the first matrix must have a number of columns that’s equal to the number of rows in the second matrix. Remember that tensor.shape returns the number of rows and columns a given tensor contains.
• torch.matmul() - more complex. Supports broadcasting of matrix sizes, so it will return a result even if the input matrices are not correctly sized according to the rules of matrix multiplication.

In order to calculate the output of a multi-layer network using weights, W1 and W2, and biases, B1 and B2. Note that the output of the hidden layer is stored as h and I assume the availability of features tensor as input.

h = activation(torch.mm(features, W1) + B1)
output = activation(torch.mm(h, W2) + B2)


There are three key methods for reshaping tensors:

• tensor.reshape(a, b) - sometimes returns a clone of the original tensor, which means copying the original tensor to another part of memory.
• tensor.resize_(a, b) - returns the same tensor (not a clone) with a different shape. The underscore (_) denotes that the method is performed in-place. If the new shape results in fewer or more elements than the original, then some elements will be removed or added to the tensor. If they are added, the new elements will be uninitialized in memory.
• tensor.view(a, b) - returns a new tensor with the same data as the input tensor. This is the preferred method.

Here’s an example of calculating the output of a network that uses the view method:

y = activation(torch.mm(features, weights.view(5,1)) + bias)


PyTorch has a great feature for converting between Numpy arrays and Torch tensors. To create a tensor from a Numpy array, use torch.from_numpy(). To convert a tensor to a Numpy array, use .numpy().

## 2. Neural Network Architecture

#### Torchvision and MNIST

• MNIST is a famous dataset that has greyscale handwritten digits, each of which is 28x28 pixels.
• MNIST is one of many datasets provided as part of the torchvision package. This package sits alongside PyTorch and provides utilities, datasets, and models for doing computer vision problems.
from torchvision import datasets


trainset = datasets.MNIST('MNIST_data/', download=True, train=True, transform=transform)


The images and labels can be extracted by converting trainloader to an iterator, as follows. Note that the batch_size in the previous initialization has been set to 64. This is the number of images we get in one iteration from the dataloader. This means that each image is a tensor with size (64, 1, 28, 28): 64 images per batch, 1 color channel, 28x28 pixel images.

for image, label in trainloader:
# Do things with images, labels


An important application of the view method is shown below. This converts a 28x28 pixel image to a 784-element vector. Note that images.shape returns the number of batches, and the -1 means that view chooses on its own what the second dimension needs to be. So, this is a quick way of “flattening” a tensor.

inputs = images.view(images.shape, -1)


#### Softmax

We want the output of our network to be a probability distribution that the image belongs to any one of our classes. For this, we use the softmax function, which converts the output of the network to a value in the range (0,1). It also normalizes all the values so the sum of all the probabilities it returns is one.

The function to return the softmax is defined below.

def softmax(x):


#### nn

PyTorch has a module, nn, that has built in classes, methods, and functions to build large neural networks in a very efficient way. An example is shown below.

from torch import nn

class Network(nn.Module):
def __init__(self):
super().__init__()

# Inputs to hidden layer linear transformation
self.hidden = nn.Linear(784, 256)
# Output layer, 10 units - one for each digit
self.output = nn.Linear(256, 10)

# Define sigmoid activation and softmax output
self.sigmoid = nn.Sigmoid()
self.softmax = nn.Softmax(dim=1)

def forward(self, x):
# Pass the input tensor through each of our operations
x = self.hidden(x)
x = self.sigmoid(x)
x = self.output(x)
x = self.softmax(x)

return x


An alternative way of defining the network is to use the torch.nn.functional module, commonly imported as F as shown below.

import torch.nn.functional as F

class Network(nn.Module):
def __init__(self):
super().__init__()
# Inputs to hidden layer linear transformation
self.hidden = nn.Linear(784, 256)
# Output layer, 10 units - one for each digit
self.output = nn.Linear(256, 10)

def forward(self, x):
# Hidden layer with sigmoid activation
x = F.sigmoid(self.hidden(x))
# Output layer with softmax activation
x = F.softmax(self.output(x), dim=1)

return x


In general any function can be used as an activation function. The only requirement is that for a network to approximate a non-linear function, the activation functions my be non-linear. A few examples of common activation functions are Sigmoid, TanH (hyperbolic tangent), and ReLU (rectified linear unit).

In practice, the ReLU function is used almost exclusively as the activation function for hidden layers. ReLU is the simplest nonlinear function that can be used and networks tend to train much more quickly when using it compared to sigmoid and hyperbolic tangent.

## 3. Training Neural Networks

Training a neural network requires a loss function, sometimes also referred to as a “cost.” The loss function is a measure of how far the network’s actual prediction is from the correct prediction. An example of a loss function is the mean square error, which is commonly used in regression.

The network’s weights are iteratively adjusted to reduce the network’s error via a method called gradient descent. The gradient is the slope of the loss function with respect to our parameters. We use backpropagation to accomplish this in mulitlayered neural networks. Backpropagation is an application of the chain rule from calculus that enables us to make adjustments throughout the entire network. The amount of change during each update step is controlled by a learning rate.

In practice, the loss function used when doing classification is usually the cross-entropy loss: nn.CrossEntropyLoss. It is defined via the criterion parameter.

from torch import nn

# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10))

# Define the loss
criterion = nn.CrossEntropyLoss()

# Get our data
# Flatten images
images = images.view(images.shape, -1)

# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)


In the instructor’s experience, it is more convenient to build the model with a log-softmax output using nn.LogSoftmax or F.log_softmax. The actual probabilities can be calculated by taking the exponential torch.exp(output). With a log-softmax output, use the negative log likelihood loss, nn.NLLLoss.

An example of this setup is shown below. The parameter passed in nn.LogSoftmax(dim=1) means to calculate across each of our examples, not across each individual feature in the batches.

#### Build a feed-forward network using **LogSoftmax**
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))

#### Define the loss
criterion = nn.NLLLoss()

# Get our data
# Flatten images
images = images.view(images.shape, -1)

# Forward pass, get our logits
logps = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logps, labels)


PyTorch has a model called autograd that is used for automatically calculating the gradients of tensors. To use autograd, pass the requires_grad=True parameter during tensor initialization. This tells PyTorch to keep track of operations on a tensor so that the the gradient can be efficiently calculated.

x = torch.zeros(1, requires_grad=True)


Gradients can also be turned off using the torch.no_grad() context:

with torch.no_grad():
# Something with gradients turned off


The loss is calculated when we call the backward pass with the following command. Once calculated, we can use these gradients in gradient descent to train the network.

loss.backward()


The next necessary piece before training is to see how to use those gradients to actually update our weights. For that, we use optimizers. These come from PyTorch’s optim package.

from torch import optim

# Optimizers require the parameters to optimize and a learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)


Note that PyTorch, by default, accumulates gradients. This means if you do multiple passes forward and backward, it will keep summing up those gradients. They must be cleared with optimizer.zero_grad()

The general process for a single learning step with PyTorch:

1. Make a forward pass through the network
2. Use the network output to calculate the loss
3. Perform a backward pass through the network with loss.backward() to calculate the gradients
4. Take a step with the optimizer to update the weights

The code itself follows.

# Clear the gradients, do this because gradients are accumulated
# Forward pass, then backward pass, then update weights
output = model(images)
loss = criterion(output, labels)
loss.backward()
# Take an update step
optimizer.step()


Training the network entails looping through the training data, pulling a new batch, and then performing the steps outlined above. Each pass through the dataset is referred to as an epoch. Training that is working correctly will have a steadily decreasing running loss.

epochs = 5
for e in range(epochs):
running_loss = 0
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape, -1)

output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()
else:


## 4. Classifying Fashion MNIST

The MNIST dataset of handwritten digits is a fairly trivial dataset these days. It is easy to get really high accuracy on it with a neural network. A different dataset that is more complex is called the Fashion-MNIST dataset. It contains 28x28 greyscale images of clothes, so it is a drop in replacement for MNIST.

The dataset is loaded through torchvision:

import torch
from torchvision import datasets, transforms
import helper

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/',

testset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/',


A possible class architecture is shown below: three hidden layers and one output layer. Note that the input tensor is flattened in the forward pass, so it is not necessary to flatten it in the forward pass itself.

from torch import nn, optim
import torch.nn.functional as F

class Classifier(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 64)
self.fc4 = nn.Linear(64, 10)

def forward(self, x):
# make sure input tensor is flattened
x = x.view(x.shape, -1)

x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.log_softmax(self.fc4(x), dim=1)

return x


Now, create the network and train it.

• The criterion is the NLLLoss function since the output is log_softmax.
• The Adam optimizer is used, which is like stochastic gradient descent except:
• it has some nice properties like momentum which speeds up the actual fitting process and
• it adjust the learning rate for each of the individual parameters in your model.

Note that if you pass the images into model as shown below, it will run the forward method.

model = Classifier()
criterion = nn.NLLLoss()

epochs = 5

for e in range(epochs):
running_loss = 0
log_ps = model(images)
loss = criterion(log_ps, labels)

loss.backward()
optimizer.step()

running_loss += loss.item()
else:


## 5. Inference and Validation

### Inference

Inference is a term used to describe using a model to make predictions about new data. Neural Networks, like most machine learning algorithms, have a tendency to overfit the data they have been trained on. This means the network finds correlations and patterns that are in the training set but not in the overall dataset.

To test for overfitting, we measure the performance of the network on data that isn’t in the dataset. This data is called the validation set or test set.

• To get the training set, pass the parameter train=True when downloading the dataset from torchvision.
• To get the test set, pass the parameter train=False when downloading the dataset from torchvision.

The actual definition of performance on the validation set is up to the developer. Examples include:

• Accuracy: percent predicted correctly.
• Precision
• Recall
• Top-5 Error Rate

To get the predictions, we use the model to perform a forward pass on the test set. Then, convert to probabilities using torch.exp and find the most likely class using ps.topk(1). The prediction is whichever class has the highest probability. This one highest class is the value the network is predicting, and it is contained in top_class in the snippet below.

images, labels = next(iter(testloader))
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape, -1)
# Get the class probabilities
ps = torch.exp(model(images))
# Find the most likely class
top_p, top_class = ps.topk(1, dim=1)


### Validation

Now, we can compare this top_class against the true labels using an equality. The view method below is used to ensure that the labels have the same shape as the top_class.

equals = top_class == labels.view(*top_class.shape)
accuracy = torch.mean(equals.type(torch.FloatTensor))


equals is now a 1D byte tensor containing a 1 for correct predictions and 0 for incorrect predictions. The accuracy, then, is just the mean of the values in equals. The .type(torch.FloatTensor) converts the byte tensor to a float tensor.

The following is the actual validation loop. In the validation pass, we are not updating the network parameters, so we can speed up the code by turning off gradients with torch.no_grad().

epochs, steps = 30, 0

for e in range(epochs):
running_loss = 0

log_ps = model(images)
loss = criterion(log_ps, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()

else:
test_loss, accuracy = 0, 0

# Turn off gradients for validation, saves memory and computations
log_ps = model(images)
test_loss += criterion(log_ps, labels)

ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

print("\t{:5}\t{:10.3}\t{:9.3}\t{:8.3}".format(epoch+1,


### Regularization

We also use regularization techniques such as dropout to prevent overfitting in the first place. Dropout involves randomly dropping input units between the layers. This forces the network to share information between the weights and spread the learning more evenly throughout the network. The effect of this is to help the network to generalize to new data. The following snippet shows how to add dropout to a model.

class Classifier(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 64)
self.fc4 = nn.Linear(64, 10)

# Dropout module with 0.2 drop probability
self.dropout = nn.Dropout(p=0.2)

def forward(self, x):
# make sure input tensor is flattened
x = x.view(x.shape, -1)

# Now with dropout
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = self.dropout(F.relu(self.fc3(x)))

# output so no dropout here
x = F.log_softmax(self.fc4(x), dim=1)

return x


Note that when doing making predictions with the network (“doing inference”), we need to have all of the nodes available, so we need to turn off dropout. This is accomplished with model.eval():

# turn off gradients
# set model to evaluation mode
model.eval()
# validation
# set model back to train mode
model.train()


The validation code looks similar except that the model.eval() and model.train() lines are added.

epochs, steps = 30, 0

for e in range(epochs):
running_loss = 0

log_ps = model(images)
loss = criterion(log_ps, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()

else:
test_loss, accuracy = 0, 0

# Turn off gradients for validation, saves memory and computations
model.eval()
log_ps = model(images)
test_loss += criterion(log_ps, labels)

ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

model.train()
print("\t{:5}\t{:10.3}\t{:9.3}\t{:8.3}".format(epoch+1,


Saving and loading models is useful in situations where you want to train a model and later come back and use it to make predictions or continue training it later.

The way we save a model is by saving the state_dict. This is a dictionary that contains all of the parameters for the model (weights, bias tensors, etc). In the example below, the state dict keys are printed out.

import torch

model = Network(784, 10, [512, 256, 128])
criterion = nn.NLLLoss()

# Assume training completed

print(model)
for key in list(model.state_dict().keys()):
print(key)

Network(
(hidden_layers): ModuleList(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=128, bias=True)
)
(output): Linear(in_features=128, out_features=10, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
)
hidden_layers.0.weight
hidden_layers.0.bias
hidden_layers.1.weight
hidden_layers.1.bias
hidden_layers.2.weight
hidden_layers.2.bias
output.weight
output.bias


The simplest way to save the model is to save the state dict with torch.save. For example, we can save it to a file checkpoint.pth. .pth is the typical extension for PyTorch checkpoint files.

torch.save(model.state_dict(), 'checkpoint.pth')


Load the state dict with torch.load. Load the model with model.load_state_dict:

state_dict = torch.load('checkpoint.pth')


The model is now ready to be used.

Note that in order to use model.load_state_dict, the model must be set up with the exact parameter setup contained in checkpoint.pth. We need to rebuild the model exactly as it was when it was initially trained. A workaround is to save information about the model architecture in the checkpoint along with the state dict. The following snippets show an example of how to accomplish this.

checkpoint = {'input_size': 784,
'output_size': 10,
'hidden_layers': [each.out_features for each in model.hidden_layers],
'state_dict': model.state_dict()}
torch.save(checkpoint, 'saving-models/checkpoint.pth')


Note that this load_checkpoint function is going to be based on the architecture of the model that you write. So, the function below will not work for every model that we create. This function may need to be customized.

def load_checkpoint(filepath):
model = Network(checkpoint['input_size'],
checkpoint['output_size'],
checkpoint['hidden_layers'])

return model

model = load_checkpoint('saving-models/checkpoint.pth')


The techniques in this section are useful for working with image data that comes from the real world, unlike the carefully preprocessed MNIST and fashion-MNIST datasets. A dataset of real-world cat and dog photos from Kaggle will be used.

The datasets.ImageFolder method is the best method to use. It takes a path to a directory of photos that it expects to be organized with each class (in this case, cat and dog) having its own directory, as shown below. it also takes a list of transforms that are applied to the data.

root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png
root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png


The transforms must be combined into a pipeline of transforms with transforms.Compose(). There are many transforms available. Read through the documentation.

• Resize: converts image to a square.
• CenterCrop: trims the outside of the image.
• ToTensor: converts the image to a tensor.

Once the images are opened and transformed, they can be passed to dataloader. Setting shuffle to True is the recommended practice. Note that the dataloader is a generator, so to get data out of it you need to either:

• loop through it with a for loop or
• convert it to an iterator: next(iter(dataloader))
# Looping through it, get a batch on each loop
pass
# Get one batch


A common strategy is to use also use data augmentation on the training set. Data Augmentation a means of introducing randomness into the data itself. This helps the network generalize because its seeing the same images but in different locations, sizes, orientations, etc. This ultimately leads to better accuracy on the validation tests.

Note that data augmentation should not be applied to the test set. The test set should just be resized and center cropped. This is so that the validation images will be similar to the images that are expected when the model is in production.

• RandomRotation: randomly rotates the image up to X degrees.
• RandomResizedCrop: randomly resizes and then crops from the center X pixels square.
• RandomHorizontalFlip: randomly flips the image horizontally.
data_dir = 'path/to/data/Cat_Dog_data'

train_transforms = transforms.Compose([transforms.RandomRotation(30),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5],
[0.5, 0.5, 0.5])])
test_transforms = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5],
[0.5, 0.5, 0.5])])

train_data = datasets.ImageFolder(data_dir + '/train',
transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test',
transform=test_transforms)

batch_size=32)
batch_size=32)


## 8. Transfer Learning

Transfer learning involves adapting a pre-trained neural network to a new, different data set.

Udacity provides a set of heuristics to help determine the best way to approach transfer learning based on the size and how similar it is to the original.

The models that will be used for transfer learning in this section are trained on ImageNet. The models themselves are available from torchvision. ImageNet is a massive dataset with over 1 million labeled images in 1000 categories.

The networks trained on ImageNet are deep neural networks that have an architecture that includes convolutional layers. Convolutional networks exploit patterns and regularities in images.

The list of available models are linked here and usually have the number of layers they contain included as part of their name. Deeper models tend to be both more accurate and slower to both predict and to train, so the tradeoffs between accuracy and speed must be carefully considered. These models' depth also means they were very well as feature detectors on models they were not trained on.

To include these models, use the following import.

from torchvision import models


Most of these pretrained models require the input to be 224x224 images. Additionally, we need to match the color normalization used when the models were trained. Each color channel was normalized separately, the means are [0.485, 0.456, 0.406] and the standard deviations are [0.229, 0.224, 0.225].

data_dir = 'path/to/data/Cat_Dog_data'

train_transforms = transforms.Compose([transforms.RandomRotation(30),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

test_transforms = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)



For this example, DensNet121 will be used:

model = models.densenet121(pretrained=True)


The model is made from two main parts: the features and the classifier. The features will be frozen and not touched, and the classifier will be replaced.

# Freeze parameters so we don't backprop through them
for param in model.parameters():

from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(1024, 500)),
('relu', nn.ReLU()),
('fc2', nn.Linear(500, 2)),
('output', nn.LogSoftmax(dim=1))
]))

model.classifier = classifier


Now, the network consists of the features from DenseNet121 and an untrained classifier. Since the model is so deep, training on it will take a very long time if run on the CPU. Training the network on a GPU, instead, can lead to 100x increased training speeds.

In PyTorch, it is easy to move a model to the GPU using model.cuda(). It is important to make sure that the images are also on the GPU if the models are. This can be done with images.cuda(). Move them to local memory and the CPU with model.cpu() and images.cpu().

The following is device agnositic code that determines if the computer being used has a GPU, and if one is available, it uses it.

# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)


For a complete example of training the deep model, see my detailed Transfer Learning notes.

## Deep Learning Tips and Tricks from Udacity

#### Watch those shapes

In general, you’ll want to check that the tensors going through your model and other code are the correct shapes. Make use of the .shape method during development and debugging.

#### A few things to check if your network isn’t training appropriately

Check that you’re clearing the gradients in the training loop with optimizer.zero_grad(). When running a validation loop, be sure to set the network to evaluation mode with model.eval(), then back to training mode with model.train().

#### CUDA errors

Sometimes you’ll see this error:

RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #1 ‘mat1’

The second type is torch.cuda.FloatTensor, this means it’s a tensor that has been moved to the GPU. It’s expecting a tensor with type torch.FloatTensor, no .cuda there, which means the tensor is on the CPU. PyTorch can only perform operations on tensors that are on the same device. If you’re trying to run your network on the GPU, check to make sure you’ve moved the model and all necessary tensors to the GPU with .to(device) where device is either "cuda" or "cpu".