Two main choices we need to make when we want to build a recurrent neural network are choosing a cell type, so a long short term memory cell or a vanilla RNN cell or a gated recurrent unit cell, and how deep the model is. How many layers will we stack? And since we’ll need word embeddings if our inputs are words, we’ll also look at embedding dimensionality. In practice, LSTMs and GRUs perform better than vanilla RNN. That much is clear. So which of the two should you use? While LSTMs seemed to be more commonly used, it really depends on the task and the dataset. Multiple research papers comparing the two did not announce a clear winner. Let’s take an example task, character level language modeling. A great paper titled Visualizing and understanding your recurrent networks compare the two on two different datasets. The result on the first dataset saw GRUs doing better than LSTMs. Better here being lower across entropy loss when comparing them at different sizes. On a different dataset, the two are tied each scoring better at different sizes. In the text below the video we elaborate more on this comparison. But the recommendation here is to try both on your dataset and task and compare. Note that you don’t have to test this on your entire dataset. You can try it on a random subset of your data. Regarding the number of layers, results for character level language modeling show that a depth of at least two is shown to be beneficial. But increasing it to three actually gives mixed results. Another task like advanced speech recognition can show improvements with five and even seven layers often without LSTM cells. In the text below the video, we have provided a number of example architectures for different tasks along with the sizes and number of layers for each example. If your RNN will be using words as inputs, then you’ll need to choose a size for the embedding vector. How do we go about choosing this number? Experimental results reporting a paper titled how to generate a good word embedding show that the performance of some tasks improve the larger we make the embedding, at least until a size of 200. In other tests, however, only marginal improvements are realized beyond the size of 50.