Okay so, let’s say we have a regular neural network which recognizes images and we fitted this image. And the neural neural network guesses that the image is most likely a dog with a small chance of being a wolf and an even smaller chance of being a goldfish. But, what if this image is actually a wolf? How would the neural network know? So, let’s say we’re watching a TV show about nature and the previous image before the wolf was a bear and the previous one was a fox. So, in this case, we want to use this information to hint to us that the last image is a wolf and not a dog. So, what we do is analyze each image with the same copy of a neural network. But, we use the output of the neural network as a part of the input of the next one. And, that will actually improve our results. Mathematically, this is simple. We just combine the vectors in a linear function, which will then be squished with an activation function, which could be sigmoid or hyperbolic tan. This way, we can use previous information and the final neural network will know that the show is about wild animals in the forest and actually use this information to correctly predict that the image is of a wolf and not a dog. And, this is basically how recurrent neural networks work. However, this has some drawbacks. Let’s say the bear appeared a while ago and the two recent images are a tree and a squirrel. Based on those two, we don’t really know if the new image is a dog or a wolf. Since trees and squirrels are just as associated to domestic animals as they are with forest animals. So, the information about being in the forest comes all the way back from the bear. But, as we’ve already experienced, information coming in gets repeatedly squished by sigmoid functions and even worse than that, training a network using back propagation all the way back, will lead to problems such as the vanishing gradient problem etc. So, by this point pretty much all the bear information has been lost. That’s a problem with recurring neural networks; that the memory that is stored is normally short term memory. RNNs, have a hard time storing long term memory and this is where LSTMs or long short term memory networks will come to the rescue. So, as a small summary, an RNN works as follows; memory comes in and merges with a current event and the output comes out as a prediction of what the input is. And also, as part of the input for the next iteration of the neural network. And in a similar way, an LSTM works as follows; it keeps track not just of memory but of long term memory, which comes in and comes out. And also, short term memory, which also comes in and comes out. And in every stage, the long and short term memory in the event get merged. And from there, we get a new long term memory, short term memory and a prediction. In here, we protect old information more. If we deem it necessary, the network can remember things from long time ago. So, in the next few videos, I will show you the architecture of LSTMs and how they work.