The Long Short-Term Memory cells, or LSTM cells, were proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. The goal of the cell is to overcome the vanishing gradient problem. You will see that it allows certain inputs to be latched, or stored, for long periods of time without forgetting them as would be the case in RNNs. When calculating the gradient using backpropagation through time, the gradients stemming back many timessteps can become negligible. For the same reason, the partial derivative of the error can also become negligible. This is the fundamental problem in RNNs. You will see that in LSTMs, we want to avoid the loss of information, or the vanishing gradient problem, by intentionally latching on to some information over many timesteps. Remembering information over long periods of time is easily achieved using these methods. Many real world applications such as Google’s language translation tools, or Amazon’s Alexa, are powered by LSTMs. In reality, most of the RNN applications we mentioned are moving towards implementations using LSTMs. To understand the difference between LSTMs and RNNs, let’s look at the RNN system, and zoom in on one neuron in a hidden layer, and understand how we calculated S of T plus one. Zooming in a bit more will help us remember how the calculations are done. As you may recall, the next state was calculated through a simple activation function, let’s say a hyperbolic tangent, of a linear combination of different inputs, and their corresponding weight matrices. The output was calculated as a simple linear combination as well. Using LSTMs, we no longer have basic computations as we had here in a single neuron. Zooming out again, our system will have a very similar layout. The neurons of the hidden states are replaced with LSTM cells, and can be stacked like lego pieces, just as before. If we zoom in on one cell, we will find that we no longer have a single calculation, but we have four separate ones. This is our LSTM cell, this is our first calculation, our second, our third, and finally, the fourth. The LSTM network allows a recurrent system to learn over many timesteps while training the network using the same backpropagation principles. In these cases, we are no longer considering a few timesteps back as we did in RNNs, but can consider over 1,000. The cell is fully differentiable, meaning, that all of its functions have a gradient or a derivative that we can calculate. These functions are a sigmoid, a hyperbolic tangent, multiplication, and addition. This allows us to use backpropagation, or stochastic gradient descent, when updating the weights. The main idea with LSTM cells, is that they can decide which information to remove or forget, which information to store, and when to use it. The cell can also help decide when to move the previous states information to the next. We just saw that the LSTM cell has three sigmoids. The output of each sigmoid is between zero and one. Having the data flow through a sigmoid intuitively answers the following questions. Do we let all the data flow through when the output of the sigmoid is one, or close to it? Or do we force the output to be a zero, where none of the data flows through.? And that would be if the output of the sigmoid is zero, or close to it. These three sigmoids act as a mechanism to filter what goes into the cell, if at all, what is retained within the cell, and what passes through to its output. The key idea in LSTMs is that these three gating functions are also trained using backpropagation by adjusting the weights that feed into them. Our next set of videos will help you understand LSTMs further.