So let’s recap. We have the following problem: we are watching a TV show and we have a long term memory which is that the show is about nature and science and lots of forest animal have appeared. We also have a short term memory which is what we have recently seen which is squirrels and trees. And we have a current event which is what we just saw, the image of a dog which could also be a wolf. And we want these three things to combine to form a prediction of what our image is. In this case, the long term memory which says that the show is about forest animals will give us a hint that the picture is of a wolf and not a dog. We also want the three pieces of information, long term memory, short term memory, and the event, to help us update the long term memory. So let’s say we keep the fact that the show is about nature and we forget that it’s about science. And we also remember that the show is about forest animals and trees since we recently saw a tree. So we add a bit and remove a bit to the long term memory. And finally we also want to use these three pieces of information to help us update the short term memory. So let’s say in our short term memory you want to forget that the show has trees and remember that it has wolves since the trees happened a few images ago and we just saw a wolf. So basically we have an architecture like this and we use even more animals to represent our stages of memory. The long term memory is represented by an elephant since elephants have long term memory. The short term memory will be represented by a forgetful fish and the event will still be represented by the Wolf we just saw. So LSTM works as follows: the three pieces of information go inside the node and then some math happens and then the new pieces of information get updated and come out. There is a long term memory, a short term memory and the prediction of the event. More specifically the architecture of the LSTM contains a few gates. It contains a forget gate, a learn gate, a remember gate, and a use gate. And here’s basically how they work. So the long term memory goes to the forget gate where it forgets everything that it doesn’t consider useful. The short term memory and the event are joined together in the learn gate, containing the information that we’ve recently learned and it removes any unnecessary information. Now the long term memory that we haven’t forgotten yet plus the new information that we’ve learned get joined together in the remember gate. This gate puts these two together and since it’s called remember gate, what it does is it outputs an updated long term memory. So this is what we’ll remember for the future. And finally, the use gate is the one that decides what information we use from what we previously know plus what we just learned to make a prediction so it also takes those inputs the long term memory, and the new information joins them and decides what to output. The output becomes both the prediction and the new short term memory. And so the big unfolded picture that we have is as follows: we have the long term memory and the short term memory coming in which we call LTM and STM. And then an event and an output are coming in and out of the LSTM. And then this passes to the next node, and so on and so forth. So in general at time t we label everything with an underscore t as we can see information passes from time t -1 to time t.