We now need to calculate the gradient. We will do that one step at a time. In our example, we only have one hidden layer. So the back propagation process will have two steps. Let’s be more precise now and decide that the gradient calculated for each element Ij in the matrix is called delta of Ij. In step number one, we will calculate the gradient with respect to the weight vector W2 from the output to the hidden layer. And in step two, we will calculate the gradient with respect to the weight matrix W1 from the hidden layer to the input. Okay, so let’s start with step number one. We already calculated y. We will find its partial derivative with respect to the weight vector W2. Given that y is a linear summation over terms, you will find that the gradient is simply the value of the corresponding activation h, as all the other terms were zero. This will hold for gradient one, gradient two, and gradient three. You probably noticed that we have only one index for delta here and that’s because we have a single output. Previously, we saw that the incremental value delta of Wij equals to the learning rate alpha multiplied by d minus y and multiplied by the gradient. So at the second layer, the incremental value of delta of Wi equals alpha multiplied by d minus y and by hi. In our second step, we want to update the weights of layer one by calculating the partial derivative of y with respect to the weight matrix W1. Here is where things get a little more interesting. When calculating the gradient, with respect to the weight matrix W1, we need to use the chain rule. The approach, obtain the partial derivative of y with respect to h and multiply it by the partial derivative of h with respect to the corresponding elements in W1. In this example, we only have three neurons in the single hidden layer. Therefore, this will be a linear combination of three elements. Let’s calculate each derivative separately. Since y is a linear combination of h and its corresponding weights, its partial derivative with respect to h will be the weight elements a vector W2. Now, what is the partial derivative of each element of vector h with respect to its corresponding weights in matrix W1? Let’s see. We’ve already considered each element of vector h separately. So here is h1. We also have h2 and h3. If we generalize this, each element j is an activation function of a corresponding linear combination. Finding its partial derivative means finding the partial derivative of the activation function and multiplying it by the partial derivative of the linear combination. All, of course, with respect to the correct elements of the weight W1. Feel free to pause anytime you need to catch up on your notes. As I said before, there are various activation functions. So let’s just call the partial derivative of the activation function fi prime. Each neuron will have its own value of fi and fi prime, according to the activation function you use. The partial derivative of the linear combination with respect to Wij is simply xi since all of the other components are zero. So the partial derivative of h with respect to the weight matrix W1 is simply fi prime at neuron j multiplied by xi. We now have all the pieces required for step number two, giving us the gradient delta of ij. We know that the gradient of the output y with respect to each of the elements in matrix W1 is the multiplication of these two partial derivatives that we just calculated. Since in the example, we have two inputs and three hidden neurons, we will have six gradients to calculate, delta one one, delta one two, delta one three, delta two one, delta two two and finally, delta two three. After finding the gradient in step two, finding the incremental value of Wij is immediate. Again, in the case of the loss function that we’re using here, it’s simply the multiplication of the gradient by alpha and by d minus y. At the end of the back propagation part, each element in the weight matrix can be updated by the incremental values we calculated using these two steps. If we have more layers which is usually the case, we will have more steps. You can imagine that the process becomes more complicated. Luckily, we have the programming tools for that. In all of these calculations we did not emphasize the biased input as it does not change any of the concepts we covered. As I mentioned before, simply consider the bias is a constant input that is also connected to each of the neurons of the hidden layers by weight. The only difference between the bias and the other inputs is the fact that it remains the same as each of the other inputs change. In this example, for each new input we updated the weights after each calculation of the output. It is often beneficial to update the weights once every end steps. This is called mini batch training and involves averaging the changes to the weights over multiple steps before actually updating the weights. There are two primary reasons for using mini batch training. The first is to reduce the complexity of the training process since fewer computations are required. The second and more important is that when we average multiple, possibly noisy changes to the weights, we end up with a less noisy correction. This means that the learning process may actually converge faster and more accurately.