Now that we’ve taken the time to preprocess and batch our data, it’s time to actually start building the network. Here, we can see the general structure of the network that we’re going to build. So, we have our inputs, which are going to be like batches of our train word tokens, and as we saw when we loaded in a batch, a lot of these values will actually be repeated in this input vector. So, we’re going to be parsing in a long list of integers, which are going into this hidden layer, our embedding layer. The embedding layer, is responsible for looking at these input integers and basically creating a lookup table. So, for each possible integer value, there will be a row in our embedding weight matrix, and the width of the matrix will be the embedding dimension that we define. That dimension will be the size of the embedding layers outputs. Then these embeddings are fed into a final fully connected softmax output layer. Remember, that in the skip gram model, we’re parsing in some input words and we’re training this whole model to generate target context words. So, for one input value, the targets will be randomly selected context words from a window around the input word. Our output layer, is going to output the probability that a randomly selected context word is going to be the word the, or of, or nine, or any other word in our vocabulary. We’re going to be trying to predict our target context words using the outputs of the softmax layer. Basically, looking at the words with the highest probability that they are context words. Then when we train everything, what’s going to happen, is that our hidden layer is going to form these vector representations of the input words. So, each row in the embedding look-up table will be a vector representation for a word. Row zero, will be the embedding for the word the, for example. These vectors contain some semantic meaning, and that’s what we’re really interested in. We only really care about these embeddings. From these embeddings, we can do some interesting things. Performing vector math to see which of our words are most similar, or we can use these embeddings as input to another model that works with the same text input data. So, when we’re done training, we can actually just get rid of this last softmax layer, because it’s just there to help us train this model and create correct embeddings in the first place. Okay. So, right before we define the model, I have a function that will help us see what kind of word relationships this model is learning. When I introduced the idea of word2vec, I mentioned that representing words as vectors, gives us the ability to mathematically operate on these words in vector space. To see which words are similar, I’m going to calculate how similar vectors are using cosine similarity. Cosine similarity, looks at two vectors a and b and the angle between them, theta. It says, “Okay. The similarity between these two vectors is just the cosine of the angle between them.” If you’re familiar with vector math, that can also be calculated as the normalized dot product of a and b. You can really just think of it like this. When theta is zero, cosine of theta is equal to one. This is the maximum value that cosine can take. When theta is 90 degrees or rather these vectors are orthogonal to one another, then the cosine is going to be zero. So, the similarity really ends up being a value between zero and one, that indicates how similar two vectors are in vector space. So, let’s look at this cosine similarity function. This function takes in an embedding layer, a validation size, and a validation window. In here, I’m getting the embeddings from the pasting layer. These are just the layer weights. Then, I’m doing some math and storing the magnitudes of these embedding vectors. That magnitude is just going to be the square root of the sum of the embedded vectors squared. Then, I’m randomly selecting some common and uncommon validation word examples. These are just integers in a range, in this case, from zero to 1,000 for common words, and for a higher range for uncommon words. Recall that lower indices indicate that a word appears more frequently. So, I’m generating half of our validation examples from a more common range, and half from a more uncommon range. These are collected in an np array and then converted into a long tensor type. Then I’m passing these validation examples into the embedding layer. In return, I get their vector representations back. So, these validation words are encoded as our vectors, a, and we’re going to calculate the similarity between a, and each word vector, b, in the embedding table. We mentioned that the similarity is a dot product of a and b over the magnitude. This dot product is just a matrix multiplication between the validation vectors a and the transpose of the embedded vectors b. Here, I’m dividing by the magnitude, and this is not the exact equation here, but it will give us valid values for similarities, just scaled by a constant. This function returns the validation examples and similarities. This gives us all we need to later print out the validation words and the words in our embedding table that are semantically similar to those words. It’s going to be a nice way to check that our embedding table is grouping together words with similar semantic meanings. So, this is a given function, you don’t have to change anything about this. Now, on to defining the model. So, we know our model accepts some inputs, then it has an embedding layer, and a final softmax output layer. You’ll have to define this using PyTorch’s embedding layer, which you can read about here. Here’s the documentation. So, the embedding layer is known as a sparse layer type. It takes in a number of input embeddings, which is going to be the number of rows in your embedding weight lookup matrix and an embedding dimension. This is the size of each embedding vector. The number of columns in your embedding look-up table. These two are the most important inputs when defining this layer. So, after the embedding layer, you’ll define a linear layer to go from our embedding size to our predicted context words. You’ll also have to apply a softmax function to the output, so that this model returns word probabilities. So, here’s the skeleton code for this model, and when we instantiate this model, we’re going to be parsing in input values for n_vocab, the size of our vocabulary, and n_embed, our embedding dimension. So, you should be able to complete the init and forward functions for this model. When you do that, you should be able to proceed with training using the provided training loops below. I’d really recommend training on GPU. Training this particular model takes quite a while even on GPU. So, I’d start training with maybe just one or two epics for now. All right. So, I’ll leave this as an exercise, and next I’ll go over one solution for defining a skip-gram model and training it.