This is what we want our model to look like. It should take in some inputs and then put those through an embedding layer, which produces some embedded vectors that are sent to a final softmax output layer, and here’s my model definition. You can see that it’s a pretty simple model. First, I’m defining my embedding layer, the self.embed. This takes n the length of my word vocabulary. This means that it will create an embedding weight matrix that has a row for each of the words in our vocabulary, and this will output vectors of size n_embed, our embedding dimension. Then, I have a fully-connected layer that takes in that embedding dimension as input, and its output size is also the length of our vocabulary. That’s because this output is a series of word class scores that tells us the likely context word for a given input word, and then I’ve defined a softmax activation layer here. You could have just done this in the forward function too. This is just one solution. Then at my forward function, I’m passing in my input X into the embedding layer. This returns are embeddings which moves to our fully-connected layer, which returns a series of class scores. Finally, a softmax activation function is applied and I’ll be left with my log probabilities for context words. Below in this training section, I’m actually going to instantiate this model. So here, I’ve defined an embedding dimension and I’ve set this to 300, but you’re welcome to experiment with larger or smaller values. The embedding dimension can be thought of as the number of word features that we can detect, like the length, the type of word, and so on. So, this takes in the entire length of our vocabulary and the embedding dimension, and I’ve moved this to a GPU for training. Here, you’ll see that I’m using negative log-likelihood loss, and this is because a softmax in combination with negative log-likelihood basically equals cross entropy loss. So, this is a great loss for looking at probabilities of context words, and I’m using an Adam optimizer, which is just my go-to and I’m passing in my model parameters and a learning rate. Then I have my training loop and I’ve decided to train for five epochs. In this training, actually took a few hours even on GPU, so I’d recommend that you train for a shorter amount of time or wait until I show you how to train more efficiently. So, in my training loop, I’m getting batches of data by calling the generator function that we defined above and passing in my list of chain words and a batch size. I’m getting my inputs and my target context words, and I’m converting them into LongTensor types, and moving these two GPU if it’s available, and then I’m performing backpropagation as usual, and passing my inputs into my skip-gram model to get the log probabilities for the context words. Then I’m applying my loss function to these contexts words and my targets, then performing backpropagation and updating the weights of my model, not forgetting to zero out any accumulated gradients before these two steps. Then I’m printing out some validation examples using my cosine similarity function. Here, I’m passing in my model and a GPU device, and I’m getting back some validation examples and their similarities. Here, I’m actually using topk sampling to get the top six most similar words to a given example. Here, I’m iterating through my validation examples. I’m printing out the first validation word and then the five closest words next to it after a line character, and here are some initial results. I printed a lot of data after training for five epochs. At first, these word associations look pretty random. We have and, returns, liverpudlians, and so on. But as I train, I should see that these validation words are getting more and more similar. If I scroll down all the way to the end of my training, I can see that similar words are nicely grouped together. You can see a bunch of number words are grouped together. Here, I have a bunch of animals and mammals grouped in one line, some lines that are related to states and politics, and even lines that are related to a place and a language. So, it looks like my word2vec model is learning, and I can visualize these embeddings in another way too. Another really powerful method for visualization is called t-SNE, which stands for t-distributed stochastic neighbor embeddings. It’s a non-linear dimensionality reduction technique that aims to separate data in a way that cluster similar data close together and separates different data. In this case, it’s an algorithm that I’m loading in from the sklearn library. I give it the number of embeddings that I want to visualize, and I get these embeddings from the weights of our embedded layer which I’m calling by name from our model. So, remember that our embedding layer was just named embed, and I can get the weights by same model.embed.weight. So here, I’m applying t-SNE to 600 of our embeddings, and this is what this t-SNE clustering ends up looking like. We can actually see that similar words are grouped together. Here we have east, west, north, and south. If we look to the right, we can see some musical terms: rock, music, album, band, and song. Lower down, we can see some religious terms, some colors over here, some academic terms: school, university, and college. On the left side here, I can see clusters of the months in the year and it looks like a few integer values here. So, this clustering indicates that my word2vec model has worked. It learns to generate embeddings that hold semantic meaning, and this also gives us a cool way to visualize the relationships between words in space. So, one problem with this model was that it took quite a while to train, and next I’m going to address that challenge.