Let’s look at the feedforward part first. To make our computations easier, let’s decide to have n inputs, three neurons in a single hidden layer, and two outputs. By the way, in practice, we can have thousands of neurons in a single hidden layer. We will use W_1 as the set of weights from x to h, and W_2 as the set of weights from h to y. Since we have only one hidden layer, we will have only two steps in each feedforward cycle. Step one, we’ll be finding h from a given input and a set of weights W_1. And step two, we’ll be finding the output y from the calculated h and the set of weights W_2. You will find that other than the use of non-linear activation functions, which I will talk about soon, all of the calculations involve linear combinations of inputs and weights. Or in other words, we will use matrix multiplications. Let’s start with step number one, finding h. Notice that if we have more than one neuron in the hidden layer, which is usually the case, h is actually a vector. We will have our initial inputs x, x is also a vector, and we want to find the values of the hidden neurons, h. Each input is connected to each neuron in the hidden layer. For simplicity, we will use the following indices: W_11 connects x_1 to h_1, W_13 connects x_1 to h_3, W_21 connects x_2 to h_1, W_n3 connects x_n to h_3, and so on. The vector of the inputs x_1, x_2, all the way up to x_n, is multiplied by the weight matrix W_1 to give us the hidden neurons. So each vector, h, equals vector x multiplied by the weight matrix, W_1. In this case, we have a weight matrix with n rows, as we have n inputs, and three columns, as we have three neurons in the hidden layer. If you multiply the input vector by the weight matrix, you will have a simple linear combination for each neuron in the hidden layer giving us vector, h. So for example, h_1 will be x_1 times W_11, plus x_2 times W_21, and so on. But we are not done with calculating the hidden layer yet. Notice the prime symbol I’ve been using? I used it to remind us that we are not done with finding h yet. We’re almost there but not quite yet. To make sure that the values of h do not explode or increase too much in size, we need to use an activation function usually denoted by the Greek letter, phi. We can use a hyperbolic tangent. Using this function will ensure that our outputs are between one and negative one. We can also use a sigmoid. Using this function will ensure that our outputs are between one and zero. We can also use a rectified linear unit or in short, a ReLu function, where the negative values are nulled and the positive values remain as they are. Each activation function has its advantages and disadvantages. What they all share is that they allow the network to represent nonlinear relationships between its inputs and its outputs. And this is very important since most real world data is nonlinear. Mathematically, the linear combination and activation function can simply be written as h equals to the output of an activation function of the input vector multiplied by the corresponding weight matrix. Using these functions can be a bit tricky as they contribute to the vanishing gradient problem that we mentioned before. But more on this later.