9 – PyTorch V2 Part 3 Solution V2

Hi and welcome back. Here’s my solution for this model that uses a LogSoftmax output. It is a pretty similar to what I built before with an index sequential. So, we just use a linear transformation, ReLU, linear transformation, ReLU, another linear transformation for output and then we can pass this to our LogSoftmax module. So, what I’m doing here is I’m making sure I set the dimension to one for LogSoftmax and this makes it so that it calculates the function across the columns instead of the rows. So, if you remember, the rows correspond to our examples. So, we have a batch of examples that we’re passing to our network and each row is one of those examples. So, we want to make sure that we’re covering the softmax function across each of our examples and not across each individual feature in our batches. Here, I’m just defining our loss or criterion as the negative log likelihood loss and again get our images and labels from our train loader, flatten them, pass it through our model to get the logits. So, this is actually not the largest anymore, this is like a log probability, so we call it like logps, and then you do that. There you go. You see we get our nice loss. Now, we know how to calculate a loss, but how do we actually use it to perform backpropagation? So, PyTorch towards actually has this really great module called Autograd that automatically calculates the gradients of our tensors. So, the way it works is that, PyTorch will keep track of all the operations you do on a tensor and then when you can tell it to do a backwards pass, to go backwards through each of those operations and calculate the gradients with respect to the input parameters. In general, you need to tell PyTorch that you want to use autograd on a specific tensor. So, in this case, you would create some tensor like x equals torch.zeros, just to make it a scalar, say one and then give it requires grad equals true. So, this tells PyTorch to track the operations on this tensor x, so that if you want to get the gradient then it will calculate it for you. So, in general, if you’re creating a tensor and you don’t want to calculate the gradient for it, you want to make sure this is set to false. You can also use this context torch.no grad to make sure all the gradients are shut off for all of the operations that you’re doing while you’re in this context. Then, you can also turn on or off gradients globally with torch.set grad enabled and give it true or false, depending on what you want to do. So, the way this works in PyTorch is that you basically create your tensor and again, you set requires grad equals true and then you just perform some operations on it. Then, once you are done with those operations, you type in.backwards. So, if you use x, this tensor x, then calculate some other tensor z then if you do z.backward, it’ll go backwards through your operations and calculate the total gradient for x. So, for example, if I just create this random tensor, random two-by-two tensor, and then I can square it like this. What it does, you can actually see if you look at y, so y is our secondary or squared tensor. If you look at y.grad function, then it actually shows us that this grad function is a power. So, PyTorch just track this and it knows that the last operation done was a power operation. So, now, we can take the mean of y and get another tensor z. So, now this is just a scalar tensor, we’ve reduced y, y is a two-by-two matrix, two by two array and then we take in the mean of it to get z. Ingredients for tensor show up in this attribute grad, so we can actually look at what’s the gradient of our tensor x right now, and we’ve only done this forward pass, we haven’t actually calculated the gradient yet and so it’s just none. So, now if we do z.backward, it’s going to go backwards through this tiny little set of operations that we’ve done. So, we did a power and then a mean and let’s go backwards through this and calculate the gradient for x. So, if you actually work out the math, you find out that the gradient of z with respect to x should be x over two and if we look at the gradient, then we can also look at x divided by two then they are the same. So, our gradient is equal to what it should be mathematically, and this is the general process for working with gradients, and autograd, and PyTorch. Why this is useful, is because we can use this to get our gradients when we calculate the loss. So, if remember, our loss depends on our weight and bias parameters. We need the gradients of our weights to do gradient descent. So, what we can do is we can set up our weights as tensors that require gradients and then do a forward pass to calculate our loss. With the loss, you do a backwards pass which calculates the gradients for your weights, and then with those gradients, you can do your gradient descent step. Now, I’ll show you how that looks in code. So, here, I’m defining our model like I did before with LogSoftmax output, then using the negative log-likelihood loss, get our images and labels from our train loader, flatten it, and then we can get our log probabilities from our model and then pass that into our criterion, which gives us the actual loss. So, now, if we look at our models weights, so model zero gives us the parameters for this first linear transformation. So, we can look at the weight and then we can look at the gradient, then we’ll do our backwards pass starting from the loss and then we can look at the weight gradients again. So, we see before the backward pass, we don’t have any because we haven’t actually calculated it yet but then after the backwards pass, we have calculated our gradients. So, we can use these gradients in gradient descent to train our network. All right. So, now you know how to calculate losses and you know how to use those losses to calculate gradients. So, there’s one piece left before we can start training. So, you need to see how to use those gradients to actually update our weights, and for that we use optimizers and these come from PyTorch’s Optim package. So, for example, we can use stochastic gradient descent with optim.SGD. The way this is defined is we import this module optim from PyTorch and then we’d say optim.SGD, we give it our model parameters. So, these are the parameters that we want this optimizer to actually update and then we give it a learning rate, and this creates our optimizer for us. So, the training pass consists of four different steps. So, first, we’re going to make a forward pass through the network then we’re going to use that network output to calculate the loss, then we’ll perform a backwards pass through the network with loss.backwards and this will calculate the gradients. Then, we’ll make a step with our optimizer that updates the weights. I’ll show you how this works with one training step and then you’re going to write it up for real and a loop that is going to train a network. So, first, we’re going to start by getting our images and labels like we normally do from our train loader and then we’re going to flatten them. Then next, what we want to do is actually clear the gradients. So, PyTorch by default accumulates gradients. That means that if you actually do multiple passes and multiple backwards like multiple forward passes, multiple backwards passes, and you keep calculating your gradient, it’s going to keep summing up those gradients. So, if you don’t clear gradients, then you’re going to be getting gradients from the previous training step in your current training step and it’s going to end up where your network is just not training properly. So, for this in general, you’re going to be calling zero grad before every training passes. So, you just say optimizer.zero grad and this will just clean out all the gradients and all the parameters in your optimizer, and it’ll allow you to train appropriately. So, this is one of the things in PyTorch that is easy to forget, but it’s really important. So, try your hardest to remember to do this part, and then we do our forward pass, backward pass, then update the weights. So, we get our output, so we do a forward pass through our model with our images, then we calculate the loss using the output of the model and our labels, then we do a backwards pass and then finally we take an optimizer step. So, if we look at our initial weights, so it looks like this and then we can calculate our gradient, and so the gradient looks like and then if we take an optimizer step and update our weights, then our weights have changed. So, in general, what has worked is you’re going to be looping through your training set and then for each batch out of your training set, you’ll do the same training pass. So, you’ll get your data, then clear the gradients, pass those images or your input through your network to get your output, from that, in the labels, calculate your loss, and then do a backwards pass on the loss and then update your weights. So, now, it’s your turn to implement the training loop for this model. So, the idea here is that we’re going to be looping through our data-set, so grabbing images and labels from train loader and then on each of those batches, you’ll be doing the training pass, and so you’ll do this pass where you calculate the output of the network, calculate the loss, do backwards pass on loss, and then update your weights. Each pass through the entire training set is called an epoch and so here I just have it set for five epochs. So, you can change this number if you want to go more or less. Once you calculate the loss, we can accumulate it to keep track, so we’re going to be looking at a loss. So, this is running loss and so we’ll be printing out the training loss as it’s going along. So, if it’s working, you should see the loss start falling, start dropping as you’re going through the data. Try this out yourself and if you need some help, be sure to check out my solution. Cheers.