Hello, everyone, and welcome back. So, in this video and in this notebook, I’ll be showing you how to actually train neural networks in PyTorch. So, previously, we saw how to define neural networks in PyTorch using the nn module, but now we’re going to see how we actually take one of these networks that we defined and train it. So, what I mean by training is that we’re going to use our neural networks as a universal function approximator. What that means is that, for basically any function, we have some desired input for example, an image of the number four, and then we have some desired output of this function. In this case a probability distribution that is telling us the probabilities of the various digits. So, in this case, if we passed it in image four, we want to get out a probability distribution where there’s a lot of probability in the digit four. So, the cool thing about neural networks is that if you use non-linear activations and then you have the correct dataset of these images that are labeled with the correct ones, then basically you pass in an image and the correct output, the correct label or class, and eventually your neural network will build to approximate this function that is converting these images into this probability distribution, and that’s our goal here. So, basically we want to see how in PyTorch, we can build a neural network and then we’re going to give it the inputs and outputs, and then adjust the weights of that network so that it approximates this function. So, the first thing that we need for that is what is called a loss function. So, it’s sometimes also called the cost, and what this is it’s a measure of our prediction error. So, we pass in the image of a four and then our network predicts something else that’s an error. So, we want to measure how far away our networks prediction is from the correct label, and we do that using loss function. So, in this case, it’s the mean squared error. So, a lot of times you’ll use this in regression problems, but use other loss functions and classification problems like this one here. So, the loss depends on the output of our network or the predictions our network is making. The output of a network depends on the weight. So, like the network parameters. So, we can actually adjust our weights such that this loss is minimized, and once the loss is minimized, then we know that our network is making as good predictions as it can. So, this is the whole goal to adjust our network parameters to minimize our loss, and we do this by using a process called gradient descent. So, the gradient is the slope of the loss function with respect to our perimeters. The gradient always points in the direction of fastest change. So, for example if you have a mountain, the gradient is going to always point up the mountain. So, you can imagine our loss function being like this mountain where we have a high loss up here and we have a low loss down here. So, we know that we want to get to the minimum of our loss when we minimize our loss, and so, we want to go downwards. So, basically, the gradient points upwards and so, we just go the opposite direction. So, we go in the direction of the negative gradient, and then if we keep following this down, then eventually we get to the bottom of this mountain, the lowest loss. With multilayered neural networks, we use an algorithm called backpropagation to do this. Backpropagation is really just an application of the chain rule from calculus. So, if you think about it when we pass in some data, some input into our network, it goes through this forward pass through the network to calculate our loss. So, we pass in some data, some feature input x and then it goes through this linear transformation which depends on our weights and biases, and then through some activation function like a sigmoid, through another linear transformation with some more weights and biases, and then that goes in, from that we can calculate our loss. So, if we make a small change in our weights here, W1, it’s going to propagate through the network and end up like results in a small change in our loss. So, you can think of this as a chain of changes. So, if we change here, this is going to change. Even that’s going to propagate through here, it’s going to propagate through here, it’s going to propagate through here. So, with backpropagation, we actually use these same changes, but we go in the opposite direction. So, for each of these operations like the loss and the linear transformation into the sigmoid activation function, there’s always going to be some derivative, some gradient between the outputs and inputs, and so, what we do is we take each of the gradients for these operations and we pass them backwards through the network. Each step we multiply the incoming gradient with the gradient of the operation itself. So, for example just starting at the end with the loss. So, we pass this gradient or the loss dldL2. So, this is the gradient of the loss with respect to the second linear transformation, and then we pass that backwards again and if we multiply it by the loss of this L2. So, this is the linear transformation with respect to the outputs of our activation function, that gives us the gradient for this operation. If you multiply this gradient by the gradient coming from the loss, then we get the total gradient for both of these parts, and this gradient can be passed back to this softmax function. So, as the general process for backpropagation, we take our gradients, we pass it backwards to the previous operation, multiply it by the gradient there, and then pass that total gradient backwards. So, we just keep doing that through each of the operations in our network, and eventually we’ll get back to our weights. What this does is it allows us to calculate the gradient of the loss with respect to these weights. Like I was saying before, the gradient points in the direction of fastest change in our loss, so, to maximize it. So, if we want to minimize our loss, we can subtract the gradient off from our weights, and so, what this will do is it’ll give us a new set of weights that will in general result in a smaller loss. So, the way that backpropagation algorithm works is that it will make a forward pass through a network, calculate the loss, and then once we have the loss, we can go backwards through our network and calculate the gradient, and get the gradient for a weights. Then we’ll update the weights. Do another forward pass, calculate the loss, do another backward pass, update the weights, and so on and so on and so on, until we get sufficiently minimized loss. So, once we have the gradient and like I was saying before, we can subtract it off from our weights, but we also use this term Alpha which is called the learning rate. This is basically just a way to scale our gradients so that we’re not taking too large steps in this iterative process. So, what can happen if you’re update steps are too large, you can bounce around in this trough around the minimum and never actually settle in the minimum of the loss. So, let’s see how we can actually calculate losses in PyTorch. Again using the nn module, PyTorch provides us a lot of different losses including the cross-entropy loss. So, this loss is what we’ll typically use when we’re doing classification problems. In PyTorch, the convention is to assign our loss to its variable criterion. So, if we wanted to use cross-entropy, we just say criterion equals nn.crossEntropyLoss and create that class. So, one thing to note is that, if you look at the documentation for cross-entropy loss, you’ll see that it actually wants the scores like the logits of our network as the input to the cross-entropy loss. So, you’ll be using this with an output such as softmax, which gives us this nice probability distribution. But for computational reasons, then it’s generally better to use the logits which are the input to the softmax function as the input to this loss. So, the input is expected to be the scores for each class and not the probabilities themselves. So, first I’m going to import the necessary modules here and also download our data and create it in, like you’ve seen before, as a trainloader, and so, we can get our data out of here. So, here I’m defining a model. So, I’m using nn.Sequential, and if you haven’t seen this, checkout the end of the previous notebook. So, the end of part two, will show you how to use nn.Sequential. It’s just a somewhat more concise way to define simple feed-forward networks, and so, you’ll notice here that I’m actually only returning the logits, the scores of our output function and not the softmax output itself. Then here we can define our loss. So, criterions equal to nn.crossEntropyLoss. We get our data with images and labels, flatten it, pass it through our model to get the logits, and then we can get the actual loss by bypassing in our logits and the true labels, and so, again we get the labels from our trainloader. So, if we do this, we see we have calculated the loss. So, my experience, it’s more convenient to build your model using a log-softmax output instead of just normal softmax. So, with a log-softmax output to get the actual probabilities, you just pass it through torch.exp. So, the exponential. With a log-softmax output, you’ll want to use the negative log-likelihood loss or nn.NLLLoss. So, what I want you to do here is build a model that returns the log-softmax as the output, and calculate the loss using the negative log-likelihood loss. When you’re using log-softmax, make sure you pay attention to the dim keyword argument. You want to make sure you set it right so that the output is what you want. So, go and try this and feel free to check out my solution. It’s in the notebook and also in the next video, if you’re having problems. Cheers.