Now that we completed a feedforward pass, received an output, and calculated the error, we are ready to go backwards in order to change our weights with a goal of decreasing the network error. Going backwards from the output to the input while changing the weights, is a process we call back propagation, which is essentially stochastic gradient descent computed using the chain rule. If you’re not familiar or comfortable with back propagation yet, this section will help you out. We will now be a little more mathematical. I find it fascinating to see how math comes to life, how mathematical calculations eventually lead us to implementing, in this case, a neural network, which is the main building block of AI. To be able to implement a basic neural network, one doesn’t really need a deep mathematical understanding as now we have open source tools. But to really understand how it works and to optimize our application, it’s always important to know the math. This is where I want to ask you again to add these old school techniques and write notes as I derive the math. I really believe that as you write your own notes, you will feel more confident with the math since it will be with your own handwriting. For me, doing this has always been helpful. Our goal is to find a set of weights that minimizes the network error. So how do we do that? Well, we use an iterative process presenting the network with one input at a time from our training set. As I mentioned before, during the feedforward pass for each input, we calculate the networks error. We can then use this error to slightly change the weights in the correct direction, each time reducing the error by just a bit. We continue to do so until we determine that the error is small enough. So how small is small enough? How do we know if we have a good enough mapping from the inputs to the outputs? Well, there’s no simple answer to that. You will find practical solutions to some of these questions in our next section on over-fitting. Imagine a network with only one weight, W. Assume that during a certain point in the training process, the weight has a value of WA and the error at Point A is E_of_A. To reduce the error, we need to increase the value of the weight, WA. Since the gradient or, in other words, the derivative which is the slope of the curve at Point A is negative, since it’s pointing down, we need to change the weight in its negative direction so that we actually increase the value of WA. If, on the other hand, the weight has a value of WB with a network error being E of B, to reduce the error, we need to decrease the weight WB. If you look at the gradient and point B, you’ll see that it’s positive. In this case, changing the weight by taking a step in the negative direction of the gradient would mean that we are correctly decreasing the value of the weight. The case we looked at was of a single weight which was oversimplifying the more practical case where the neural network has many weights. We can summarize the weight update process using this equation where alpha is the learning rate or the stepsize. We can also see that the weight, W, for the reasons I just mentioned, changes in the opposite direction of the partial derivative of the error with respect to W. You may ask yourselves, “Why are we looking at partial derivatives?” And the answer is simply because the error is a function of many variables and the partial derivative let’s us measure how the error is impacted by each weight separately. You can find a good resource on how to tune the learning rate at the end of this video. You will also find a full hyper parameter section following this online lesson. Since many weights determined the network’s output, we can use a vector of the partial derivatives of the network error, each with respect to a different weight. That strange inverted triangular Symbol, if you haven’t seen it before, is the gradient. This gradient is the vector of partial derivatives of the error with respect to each of the weights. For the purpose of proper notation, we will have quite a few indices here. In this illustration which should be familiar by now, we are focusing on the connections between layer k and layer k_plus_1. The weight Wij connects neuron i in layer k to neuron J in layer k_plus_1, just as we’ve seen before. Let’s call the amount by which we change or update weight, Wij, delta_of_Wij. The superscript k, indicates that the weight connects layer k, to layer k_plus_1 or, in other words, it originated from layer k. Calculating this delta weight of Wij is straightforward. It equals to the learning rate, multiplied by the partial derivative of the error with respect to weight, Wij, in each layer. We will take the negative of that term for the reasons I just mentioned before. Basically, back propagation boils down to calculating the partial derivative of the error, E, with respect to each of the weights, and then adjusting the weights according to the calculated value of delta of Wij. These calculations are done for each layer. Let’s look at our system one more time. For the error, we will use the loss function which is simply the desired output minus the network output, all that squared. Here, we’re also dividing this error term by two for calculation simplicity that you will see later. Okay, now that we have all of our equations defined, we can dive into the math with an example.