So now that we have defined what neural networks are, we need to learn how to train them. Training them really means what parameters should they have on the edges in order to model our data well. So in order to learn how to train them, we need to look carefully at how they process the input to obtain an output. So let’s look at our simplest neural network, a perceptron. This perceptron receives a data point of the form x1, x2 where the label is Y=1. This means that the point is blue. Now the perceptron is defined by a linear equation say w1, x1 plus w2, x2 plus B, where w1 and w2 are the weights in the edges and B is the bias in the note. Here, w1 is bigger than w2, so we’ll denote that by drawing the edge labelled w1 much thicker than the edge labelled w2. Now, what the perceptron does is it plots the point x1, x2 and it outputs the probability that the point is blue. Here is the point is in the red area and then the output is a small number, since the point is not very likely to be blue. This process is known as feedforward. We can see that this is a bad model because the point is actually blue. Given that the third coordinate, the Y is one. Now if we have a more complicated neural network, then the process is the same. Here, we have thick edges corresponding to large weights and thin edges corresponding to small weights and the neural network plots the point in the top graph and also in the bottom graph and the outputs coming out will be a small number from the top model. The point lies in the red area which means it has a small probability of being blue and a large number from the second model, since the point lies in the blue area which means it has a large probability of being blue. Now, as the two models get combined into this nonlinear model and the output layer just plots the point and it tells the probability that the point is blue. As you can see, this is a bad model because it puts the point in the red area and the point is blue. Again, this process called feedforward and we’ll look at it more carefully. Here, we have our neural network and the other notations so the bias is in the outside. Now we have a matrix of weights. The matrix w superscript one denoting the first layer and the entries are the weights w1, 1 up to w3, 2. Notice that the biases have now been written as w3, 1 and w3, 2 this is just for convenience. Now in the next layer, we also have a matrix this one is w superscript two for the second layer. This layer contains the weights that tell us how to combine the linear models in the first layer to obtain the nonlinear model in the second layer. Now what happens is some math. We have the input in the form x1, x2, 1 where the one comes from the bias unit. Now we multiply it by the matrix w1 to get these outputs. Then, we apply the sigmoid function to turn the outputs into values between zero and one. Then the vector format these values gets a one attatched for the bias unit and multiplied by the second matrix. This returns an output that now gets thrown into a sigmoid function to obtain the final output which is y-hat. Y-hat is the prediction or the probability that the point is labeled blue. So this is what neural networks do. They take the input vector and then apply a sequence of linear models and sigmoid functions. These maps when combined become a highly non-linear map. And the final formula is simply y-hat equals sigmoid of w2 combined with sigmoid of w1 applied to x. Just for redundance, we do this again on a multi-layer perceptron or neural network. To calculate our prediction y-hat, we start with the unit vector x, then we apply the first matrix and a sigmoid function to get the values in the second layer. Then, we apply the second matrix and another sigmoid function to get the values on the third layer and so on and so forth until we get our final prediction, y-hat. And this is the feedforward process that the neural networks use to obtain the prediction from the input vector.