We finished step one, and will now start with step number two which is finding the output y, by using the values of h, we just calculated. Since we have more than one output, y will be a vector as well. We have our initial inputs h, and want to find the values of the output y. Mathematically, the idea is identical to what we just saw in step number one. We now have different inputs. We call them h, and a different weight matrix, we call it W2. The output will be vector y. Notice that the weight matrix has three rows, as we have three neurons in the hidden layer, and two columns, as we have only two outputs. And again, we have a vector by matrix multiplication. Vector h, multiplied by the weight matrix W2, gives us the output vector y. We can put it in a simple equation, where y equals h times W. Once we have the outputs, we don’t necessarily need an activation function. In some applications, it can be beneficial to use for example, a softmax function, what we call sigma x, if we want the output values to be between zero and one. You can find more information on this topic in the text after this video. To have a good approximation of the output y, we need more than one level of hidden layers. Maybe even 10 or more. In this picture, I use the general number P. The number of neurons in each layer can change from one layer to the next, and as I mentioned before, can be even thousands. Essentially, you can look at these neurons as building blocks or lego pieces that can be stacked. So how is this done mathematically? Just as before. A simple vector by matrix multiplication, followed by an activation function, where the vector indicates the inputs, and the matrix indicates the weights connecting one layer to the next. To generalize this model, let’s look into one random K F layer. The weight Wij, from the K F layer, to the K plus one, corresponds to the I F input, going into the J F output. You might want to get a pencil for this one, as these are important mathematical derivations. As you follow the math, pause the video, write the derivations in your notes. Try to do so throughout the entire lesson. Let’s treat layer K, as the inputs x and layer K plus one as the outputs of the hidden layer h. We will have n_x inputs, and m_h outputs. By the way, if we are not dealing with inputs but rather than the outputs of the previous hidden layer, the only thing that will change is the letter that we choose to use, but the calculations will be the same. So to make the notation simple, let’s just stay with x. h1 is the output of an activation function of a sum, where the sum is a multiplication of each input xi, by its corresponding weight component Wi1. The same way hm is the output of an activation function of a sum, and the sum is the multiplication of each input xi, by its corresponding weight component Wim. For example, if we have three inputs, and we want to calculate h1, it will be the output of an activation function of the following linear combination. These single element calculations will be helpful in understanding back propagation, which is why we want to understand them as well. But as before, we can also look at these calculations as a vector by matrix multiplication. By the way you probably noticed that I didn’t emphasize the bias input here. The bias does not change any of these calculations. Simply consider it as a constant input usually one, that is also connected to each of the neurons of the hidden layer by a weight. The only difference between the bias and any other input, is the fact that it remains the same as each of the other inputs change. And just as all the other inputs, the weights connecting it to the next layer are updated as well. So let’s stop for a minute. What is our goal again? Our goal is to find the best set of weights, that will give us at the end the desired output. Right? At the end, we want to find a system that will give us the correct output, for a specific input. In the training phase, we actually know the output of a given input. We calculate the output of the system in order to adjust the weights. We do that by finding the error, and trying to minimize it. Each iteration of the training phase will decrease the error just a bit, until we eventually decide that the error is small enough. Let’s focus on an intuitive error calculation, which is simply finding the difference between the calculated, and the desired output. This is our basic error. For our back propagation calculations, we will use the square error, which is also called the loss function.