So now we’re finally ready to get our hands into training a neural network. So let’s quickly recall feedforward. We have our perceptron with a point coming in labeled positive. And our equation w1x1 + w2x2 + b, where w1 and w2 are the weights and b is the bias. Now, what the perceptron does is, it plots a point and returns a probability that the point is blue. Which in this case is small since the point is in the red area. Thus, this is a bad perceptron since it predicts that the point is red when the point is really blue. And now let’s recall what we did in the gradient descent algorithm. We did this thing called Backpropagation. We went in the opposite direction. We asked the point, “What do you want the model to do for you?” And the point says, “Well, I am misclassified so I want this boundary to come closer to me.” And we saw that the line got closer to it by updating the weights. Namely, in this case, let’s say that it tells the weight w1 to go lower and the weight w2 to go higher. And this is just an illustration, it’s not meant to be exact. So we obtain new weights, w1′ and w2′ which define a new line which is now closer to the point. So what we’re doing is like descending from Mt. Errorest, right? The height is going to be the error function E(W) and we calculate the gradient of the error function which is exactly like asking the point what does is it want the model to do. And as we take the step down the direction of the negative of the gradient, we decrease the error to come down the mountain. This gives us a new error, E(W’) and a new model W’ with a smaller error, which means we get a new line closer to the point. We continue doing this process in order to minimize the error. So that was for a single perceptron. Now, what do we do for multi-layer perceptrons? Well, we still do the same process of reducing the error by descending from the mountain, except now, since the error function is more complicated then it’s not Mt. Errorest, now it’s Mt. Kilimanjerror. But same thing, we calculate the error function and its gradient. We then walk in the direction of the negative of the gradient in order to find a new model W’ with a smaller error E(W’) which will give us a better prediction. And we continue doing this process in order to minimize the error. So let’s look again at what feedforward does in a multi-layer perceptron. The point comes in with coordinates (x1, x2) and label y = 1. It gets plotted in the linear models corresponding to the hidden layer. And then, as this layer gets combined the point gets plotted in the resulting non-linear model in the output layer. And the probability that the point is blue is obtained by the position of this point in the final model. Now, pay close attention because this is the key for training neural networks, it’s Backpropagation. We’ll do as before, we’ll check the error. So this model is not good because it predicts that the point will be red when in reality the point is blue. So we’ll ask the point, “What do you want this model to do in order for you to be better classified?” And the point says, “I kind of want this blue region to come closer to me.” Now, what does it mean for the region to come closer to it? Well, let’s look at the two linear models in the hidden layer. Which one of these two models is doing better? Well, it seems like the top one is badly misclassifying the point whereas the bottom one is classifying it correctly. So we kind of want to listen to the bottom one more and to the top one less. So what we want to do is to reduce the weight coming from the top model and increase the weight coming from the bottom model. So now our final model will look a lot more like the bottom model than like the top model. But we can do even more. We can actually go to the linear models and ask the point, “What can these models do to classify you better?” And the point will say, “Well, the top model is misclassifying me, so I kind of want this line to move closer to me. And the second model is correctly classifying me, so I want this line to move farther away from me.” And so this change in the model will actually update the weights. Let’s say, it’ll increase these two and decrease these two. So now after we update all the weights we have better predictions at all the models in the hidden layer and also a better prediction at the model in the output layer. Notice that in this video we intentionally left the bias unit away for clarity. In reality, when you update the weights we’re also updating the bias unit. If you’re the kind of person who likes formality, don’t worry, we’ll calculate these gradients in detail soon.