Here is my solution for this multi-layer neural network for classifying handwritten digits from the MNIST dataset. So, here I’ve defined our activation function like before, so, again this is the sigmoid function and here I’m flattening the images. So, remember how to reshape your tensors. So, here I’m using.view. So, I’m just grabbing the batch size. So, images.shape. The first element zero here, gives you the number of batches in your images tensor. So, I want to keep the number of batches the same, but I want to flatten the rest of the dimensions. So, to do this, you actually can just put in negative one. So, I could type in 784 here but a kind of a shortcut way to do this is to put in negative one. So, basically what this does is it takes 64 as your batch size here and then when you put a negative one it sees this and then it just chooses the appropriate size to get the total number of elements. So, it’ll work out on its own that it needs to make the second dimension, 784 so that the number of elements after reshaping matches the number elements before reshaping. So, this is just a kind of quick way to flatten a tensor without having to know what the second dimension used to be. Then here I’m just creating our weight and bias parameters. So, we know that we want an input of 784 units and we want 256 hidden units. So, our first weight matrix is going to be 784 by 256. Then, we need a bias term for each of our hidden units. So we have 256 bias terms here in b1. Then, for our second weight’s going from the hidden layer to the output layer we want 256 inputs to 10 outputs. Then again 10 elements in our bias. Before we can do a matrix multiplication of our inputs with the first set of weights, our first weight parameters, add in the bias terms and passes through our activation functions so that gives us the output of our hidden layer. Then we can use that as the input to our output layer, and again, a matrix multiplication with a second set of weights and the second set of bias terms. This gives us the output of our network. All right. So, if we look at the output of this network, we see that we get those 64. So, first let me print the shape just to make sure we did that right. So, 64 rows for one of each of our sort of input examples and then 10 values, so, basically it’s a value that’s trying to say this image belongs to this class like this digit. So, we can inspect our output tensor and see what’s going on here. So, we see these values are just sort of all over the place. So, you got like six and negative 11 and so on. But we really want is we want our network to kind of tell us the probability of our different classes given some image. So, kind of we want to pass in an image to our network and then the output should be a probability distribution that tells us which are the most likely classes or digits that belong to this image. So, if it’s the image of a six, then we want a probability distribution where most of the probability is in the sixth class. So, it’s telling us that it’s a number six. So, we want it to look something like this. This is like a class probability. So, it’s telling us the probabilities of the different classes given this image that we’re passing in. So, you can see that the probability for each of these different classes is roughly the same, and so it’s a uniform distribution. This represents an untrained network, so it’s a uniform probability distribution. It’s because it hasn’t seen any data yet, so it hasn’t learned anything about these images. So, whenever you give an image to it, it doesn’t know what it is so it’s just going to give an equal probability to each class, regardless of the image that you pass in. So, what we want is we want the output of our network to be a probability distribution that gives us the probability that the image belongs to any one of our classes. So for this, we use the softmax function. So what this looks like is the exponential. So,you pass in your 10 values. So, for each of those values, we calculate the exponential of that value divided by the sum of exponentials of all the values. So, what this does is it actually kind of squishes each of the input values x between zero and one, and then also normalizes all the values so that the sum of each of the probabilities is one. So, the entire thing sums up to one. So, this gives you a proper probability distribution. What I want you to do here is actually implement a function called softmax that performs this calculation. So, what you’re going to be doing is taking the output from this simple neural network and has shaped 64 by 10 and pass it through a dysfunction softmax and make sure it calculates the probability distribution for each of the different examples that we passed in. Right? Good luck.