We wanted to define a character RNN with a two layer LSTM. Here in my solution, I am running this code on GPU and here’s my code for defining our character level RNN. First, I defined an LSTM layer, self.lstm. This takes in an input size, which is going to be the length of a one-hot encoded input character and that’s just the length of all of my unique characters. Then, it takes a hidden dimension a number of layers and a dropout probability that we’ve specified. Remember that this will create a dropout layer in between multiple LSTM layers, and all of these are parameters that are going to be passed in as input to our RNN when it’s constructed. Then, I’ve set batch first to true because when we created our batch data, the first dimension is the batch size, rather than the sequence length. Okay. Next, I’ve defined a dropout layer to go in-between my LSTM and a final linear layer. Then, I have FC, my final fully connected linear layer. This takes in our LSTM outputs, which are going to be dimension and hidden. It’s going to output our character class scores for the most likely next character. So, these are the class scores for each possible next character. This output size is the same size as our input, the length of our character vocabulary. Then, I move to the forward function. I’m passing my input X and a hidden state to my LSTM layer here. This produces my LSTM output and a new hidden state. I’m going to pass the LSTM output through the dropout layer that I defined here to get a new output. Then, I’m making sure to reshape this output, so that the last dimension is our hidden dim. This negative one basically means I’m going to be stacking up the outputs of the LSTM. Finally, I’m passing this V-shaped output to the final fully connected layer. Then, I’m returning this final output and the hidden state that was generated by our LSTM. These two functions in addition to the init hidden function complete my model. Next, it’s time to train and let’s take a look at the training loop that was provided. This function takes in a model to train some data, and the number of epics to train for, and a batch size, and sequence length that define our mini batch size. It also takes in a few more training parameters. First in here, I’ve defined my optimizer and my loss function. The optimizer is a standard Adam optimizer with a learning rate set to the past and learning rate up here. The last function is cross entropy loss, which is useful for when we’re outputting character class scores. Here, you’ll see some details about creating some validation data and moving our model to GPU if it’s available. Here, you can see the start of our epic loop. At the start of each epic, I’m initializing the hidden state of our LSTM. Recall that this takes in the batch size of our data to define the size of the hidden state and it returns a hidden in cell state that are all zeros. Then, inside this epic loop, I have my batch loop. This is getting our X and Y mini batches from our get batches generator. Remember that this function basically iterates through are encoded data, and returns batches of inputs X and targets Y. I’m then converting the input into a one-hot encoded representation, and I’m converting both X and Y are inputs and targets into Tensors that can be seen by our model. If GPU’s available, I’m moving those inputs and targets to our GPU device. The next thing that you see is making sure that we detach any past in hidden state from its history. Recall that the hidden state of an LSTM layer is a Tuple, and so here, we are getting the data as a tuple. Then, we proceed with back propagation as usual. We zero out any accumulated gradients and pass in our input Tensors to our model. We also pass in the latest hidden state here. In this returns of final output and a new hidden state, then we calculate the loss by looking at the predicted output and the targets. Recall that in the forward function of our model, I smashed the batch size and sequence length of our LSTM outputs into one dimension, and so I’m doing the same thing for our targets here. Then, we’re performing back propagation and moving one step in the right direction updating the weights of our network. Now before the optimization step, I’ve added one line of code that may look unfamiliar. I’m calling clip grad norm. Now, this kind of LSTM model has one main problem with gradients. They can explode and get really, really big. So, what we do is we can clip the gradients, we just set some clip threshold, and then if the gradient is larger than that threshold, we set it to that clip threshold, and encode we do this by just passing in the parameters and the value that we want to clip the gradients at. In this case, this value is passed in, in our train function as a value five. Okay. So, we take a backwards step, then we clip our gradients, and we perform an optimization step. At the end here, I’m doing something very similar for processing our validation data except not performing the back propagation step. Then, I’m printing out some statistics about our loss. Now with this train function defined, I can go about instantiating and training a model. In the exercise notebook, I’ve left these hyper parameters for you to define. I’ve set our hidden dimension to the value of 512 and a number of layers up two, which we talked about before. Then, I have instantiated our model, and printed it out, and we can see that we have 83 unique characters as input, 512 as a hidden dimension, and two layers in our LSTM. For a dropout layer, we have the default dropout value of 0.5 and for our last fully connected layer, we have our Input features, which is the same as this hidden dimension and our output features, the number of characters. Then, there are more hyper parameters that define our batch size sequence length and number of epics to train for. Here, I’ve set the sequence length to 100, which is a lot of characters, but it gives our model a great deal of context to learn from. I also want to note that the hidden dimension is basically the number of features that your model can detect. Larger values basically allow a network to learn more text features. There’s some more information below in this notebook about defining hyper parameters. In general, I’ll try to start out with a pretty big model like this, multiple LSTM layers and a large hidden dimension. Then, I’ll basically take a look at the loss as this model trains and if it’s decreasing, I’ll keep going. But if it’s not decreasing as I expect, then I’ll probably change some hyper parameters. Our text data is pretty large and here, I’ve trained our entire model for 20 epics on GPU. I can see the training and validation loss over time decreasing. Around epic 15, I’m seeing the lost slow down a bit. But it actually looks like the validation and training loss are still decreasing even after epic 20. I could have stood to train for an even longer amount of time. I encourage you to read this information about setting the hyper parameters of a model and really getting the best model. Then, after you’ve trained a model like I’ve just done, you can save it by name and then there’s one last step, which is using that model to make predictions and generate some new text, which I’ll go over next.