11 – DL 46 Calculating The Gradient 2 V2 (2)

So, let us go back to our neural network with our weights and our input. And recall that the weights with superscript 1 belong to the first layer, and the weights with superscript 2 belong to the second layer. Also, recall that the bias is not called b anymore. Now, it is called W31, W32 etc. for convenience, so that we can have everything in matrix notation. And now what happens with the input? So, let us do the feedforward process. In the first layer, we take the input and multiply it by the weights and that gives us h1, which is a linear function of the input and the weights. Same thing with h2, given by this formula over here. Now, in the second layer, we would take this h1 and h2 and the new bias, apply the sigmoid function, and then apply a linear function to them by multiplying them by the weights and adding them to get a value of h. And finally, in the third layer, we just take a sigmoid function of h to get our prediction or probability between 0 and 1, which is ŷ. And we can read this in more condensed notation by saying that the matrix corresponding to the first layer is W superscript 1, the matrix corresponding to the second layer is W superscript 2, and then the prediction we had is just going to be the sigmoid of W superscript 2 combined with the sigmoid of W superscript 1 applied to the input x and that is feedforward. Now, we are going to develop backpropagation, which is precisely the reverse of feedforward. So, we are going to calculate the derivative of this error function with respect to each of the weights in the labels by using the chain rule. So, let us recall that our error function is this formula over here, which is a function of the prediction ŷ. But, since the prediction is a function of all the weights wij, then the error function can be seen as the function on all the wij. Therefore, the gradient is simply the vector formed by all the partial derivatives of the error function E with respect to each of the weights. So, let us calculate one of these derivatives. Let us calculate derivative of E with respect to W11 superscript 1. So, since the prediction is simply a composition of functions and by the chain rule, we know that the derivative with respect to this is the product of all the partial derivatives. In this case, the derivative E with respect to W11 is the derivative of either respect to ŷ times the derivative ŷ with respect to h times the derivative h with respect to h1 times the derivative h1 with respect to W11. This may seem complicated, but the fact that we can calculate a derivative of such a complicated composition function by just multiplying 4 partial derivatives is remarkable. Now, we have already calculated the first one, the derivative of E with respect to ŷ. And if you remember, we got ŷ minus y. So, let us calculate the other ones. Let us zoom in a bit and look at just one piece of our multi-layer perceptron. The inputs are some values h1 and h2, which are values coming in from before. And once we apply the sigmoid and a linear function on h1 and h2 and 1 corresponding to the biased unit, we get a result h. So, now what is the derivative of h with respect to h1? Well, h is a sum of three things and only one of them contains h1. So, the second and the third summon just give us a derivative of 0. The first summon gives us W11 superscript 2 because that is a constant, and that times the derivative of the sigmoid function with respect to h1. This is something that we calculated below in the instructor comments, which is that the sigmoid function has a beautiful derivative, namely the derivative of sigmoid of h is precisely sigmoid of h times 1 minus sigmoid of h. Again, you can see this development underneath in the instructor comments. You also have the chance to code this in the quiz because at the end of the day, we just code these formulas and then use them forever, and that is it. That is how you train a neural network.

%d 블로거가 이것을 좋아합니다: