7 – PyTorch V2 Part 2 Solution 2 V1

Welcome back. Here is my solution for the softmax function. Here in the numerator, we know we want to take the exponential, so it’s pretty straight forward with torch.exp. So we’re going to use the exponential of x, which is our input tensor. In the denominator, we know we want to do something like, again take exponentials so torch.exp, and then take the sum across all those values. So, one thing we need to remember is that we want the sum across one single row. So, each of the columns in one single row for each example. So, for one example, we want to sum up those values. So, for here in torch.sum, we’re going to use dimension equals one. So, this is basically going to take the sum across the columns. What this does, torch.sum here, is going to actually going to give us a tensor, that is just a vector of 64 elements. So, the problem with this is that, if this is 64 by 10, and this is just a 64-long vector, it’s going to try to divide every element in this tensor by all 64 of these values. So, it’s going give us a 64 by 64 tensor, and that’s not what we want. We want our output to be 64 by 10. So, what you actually need to do is reshape this tensor here to have 64 rows, but only one value for each of those rows. So, what that’s going do, it’s going look at for each row in this tensor, is going to look at the equivalent row in this tensor. So, since each row in this tensor only has one value, it’s going to divide this exponential by the one value in this denominator tensor. This can be really tricky, but it’s also super important to understand how broadcasting works in PyTorch, and how to actually fit all these tensors together with the correct shape and the correct operations to get everything out right. So, if we do this, it look what we have, we pass our output through the softmax function, and then we get our probabilities, and we can look the shape and it is 64 by 10, and if you take the sum across each of the rows, then it adds up to one, like it should with a proper probability distribution. So, now, we’re going to look at how you use this nn module to build neural networks. So, you’ll find that it’s actually in a lot of ways simpler and more powerful. You’ll be able to build larger and larger neural networks using the same framework. The way this works in general, is that we’re going to create a new class, and you can call it networking, you can call it whatever you want, you can call it classifier, you can call it MNIST. It doesn’t really matter so much what you call it, but you need to subclass it from nn.module. Then, in the init method, it’s __init method. You need to call it super and run the init method of nn.module. So, you need to do this because then, PyTorch will know to register all the different layers and operations that you’re going to be putting into this network. If you don’t do this part then, it won’t be able to track the things that you’re adding to your network, and it just won’t work. So, here, we can create our hidden layers using nn.Linear. So, what this does, is it creates a operation for the linear transformation. So, when we take our inputs x and then multiply it by weights and add your bias terms, that’s a linear transformation. So, what this does is calling NN.Linear, it creates an object that itself has created parameters for the weights and parameters for the bias and then, when you pass a tensor through this hidden layer, this object, it’s going to automatically calculate the linear transformation for you. So, all you really need to do is tell it what’s the size of the inputs, and then what are the size of the output. So, 784 by 256, we’re going to use 256 outputs for this. So, it’s kind of rebuilding the network that we saw before. Similarly, we want another linear transformation between our hidden units and our output. So, again, we have 256 hidden units, and we have 10 outputs, 10 output units, so we’re going to create a output layer called self.output, and create this linear transformation operation. We also want to create a sigmoid operation for the activation and then, softmax for the output, so we get this probability distribution. Now, we’re going to create a forward method and so, forward is basically going to be, as we pass a tensor in to the network. It’s gonna go through all these operations, and eventually give us our output. So, here, x, the argument is going to be the input tensor and then, we’re going to pass it through our hidden layer. So, this is again, like this linear transformation that we defined up here, and it’s going to go through a sigmoid activation, and then through our output layer or output linear transformation, we have here, and then through the sigmoid function, and then finally return the output of our softmax. so we can create this. Then, if we kind of look at it, so it’ll print it out, and it’ll tell us the operations, and not necessarily the order, but at least it tells us the operations that we have defined for this network. You can also use some functional definitions for things like sigmoid and softmax, and it kind of makes the class the way you write the code a little bit cleaner. We can get that from torch.nn.functional. Most of the time, you’ll see is like import torch.nn.functional as capital F. So, there’s kind of that convention in PyTorch code. So, again, we define our linear transformations, self.hidden, self.output but now in our forward method. So, we can call self.hidden to get like our values for hidden layer, but then, we pass it through the sigmoid function, f.sigmoid, and the same thing with the output layers. So, we have our output linear transformations of the output, and we pass it through this softmax operation. So, the reason we can do this because, when we create these linear transformations, it’s creating the weights and bias matrices on its own. But for sigmoid and softmax, it’s just an element wise operation, so it doesn’t have to create any extra parameters or extra matrices to do these operations, and so we can have these be purely functional without having to create any sort of object or classes. However, they are equivalent. So this way to build the network is equivalent to this way up here, but it’s a little bit more succinct when you’re doing it with these kind of functional pattern. So far, we’ve only been using the sigmoid function as an activation function, but there are, of course, a lot of different ones you want to use. Really the only requirement is that, these activation functions should typically be non-linear. So, if you want your network to be able to learn non-linear correlations and patterns, and we want the output to be non-linear, then you need to use non-linear activation functions in your hidden layers. So, a sigmoid is one example. The hyperbolic tangent is another. One that is pretty much used all the time, like almost exclusively as activation function and hidden layers, is the ReLU, so the rectified linear unit. This is basically the simplest non-linear function that you can use, and it turns out that networks tend to train a lot faster when using ReLU as compared to sigmoid and hyperbolic tangent, so ReLU was what we typically use. Okay. So, here, you’re going to build your own neural network, that’s larger. So, this time, it’s going to have two hidden layers, and you’ll be using the ReLU activation function for this on your hidden layers. So using this object-oriented class method within a.module, go ahead and build a network that looks like this, with 784 input units, a 128 units in the first hidden layer, 64 units and the second hidden layer, and then 10 output units. All right. Cheers.