Remember our feedforward illustration? We had n inputs, three neurons in the hidden layer, and two outputs. For this example, we will need to simplify things even more and look at a model with two inputs, x_1 and x_2, and a single output, y. We will have a weight matrix, W_1, from the input to the hidden layer, the elements of the matrix will be W_ij just as before. We will also have a weight matrix vector, W_2, from the hidden layer to the output. Notice that it’s a vector and not a matrix as we only have one output. Don’t forget your pencil and notes. Pause the video any time you need and make sure you’re following the math here. We will begin with a feedforward pass of the inputs across the network, then calculate the output, and based on the error, use backpropagation to calculate the partial derivatives. Calculating the values of the activations at each of the three hidden neurons is simple. We have a linear combination of the inputs with a corresponding weight elements of the matrix W_1. All that is followed by an activation function. The outputs are a dot product of the activations of the previous layer vector, H, with the weights of W2. As I mentioned before, the aim of the back propagation process is to minimize, in our case, the loss function or the square error. To do that, we need to calculate the partial derivatives of the square error with respect to each of the weights. Since we just found the output, we can now minimize the error by finding the updated values of delta of W_ij and, of course, in every case, we need to do so for every single layer K. This value equals to the negative of alpha multiplied by the partial derivative of the loss function e. Since the error is a polynomial, finding its derivative is immediate. By using basic calculus, we can seek that this incremental value is simply the learning rate alpha multiplied by d minus y, and by the partial derivative of y with respect to each of the weights. If you’re asking yourselves, what happened to the desired output D? Well, that was a constant value, so it’s partial derivative was simply a zero. Notice that I used the chain rule here? In the beginning of this video, I mentioned that backpropagation is actually stochastic gradient descent with the use of the chain rule. Well, now we have our gradient. The symbol that we use for the gradient is usually lowercase delta. This partial derivative of the calculated output defines the gradient and we will find it by using the chain rule. So let’s pause for a minute, remind ourselves what the chain rule is and how it’s used. When you’re feeling comfortable with this concept, we will continue with the actual calculation.