So, I ran all the cells in my notebook and here’s my solution and definition for the SkipGramNeg Module. First, I’ve defined my two embedding layers, in-embed and out-embed, and they’ll both take in the size of our word vocabulary and produce embeddings of size and embed. So, mapping from our vocab to our embedding dimension. Here, I’m doing an additional step which is initializing the embedding look-up tables with uniform weights between negative one and one. I’m doing this for both of our layers and I believe this helps our model reached the best way faster. Then I’ve defined my three forward functions. Forward input passes our input words through our input embedding layer and returns input embedding vectors. I do the same thing in forward output only passing that through our output embedding layer to get output vectors. Notice that there are no linear layers or softmax activation functions here. The last forward function is forward noise, which will return a noisy target embeddings. So, this samples noisy words from our noise distribution and returns the number of samples batch size times N samples. Then we get the embeddings bypassing those noise words through our output embedding layer. In the same line, I’m reshaping these to be the size I want, which is batch size by N samples by our embedding dimension and I return those vectors. Okay, so this completes the SkipGramNeg Module. Next, I’m defining a custom negative sampling loss. This was carefully defined above in our equations and I haven’t ever gotten into the details of defining a custom loss, but suffice to say that it is really similar to defining a model class. Only in this case, the init function is left empty and we’re really left with defining the forward function. The forward function should take in some inputs and targets typically and you can define what it takes in as parameters here. This should return a single value that indicates the average loss over a batch of data. So, in this case, I know I what my loss to look at an input embedded vector, my correct output embedding, and my incorrect noisy vectors. So here, I am getting the batch size and embedding dimension from the shape of my input vector, then I’m shaping my input vector into a shape that is batch first, and I’m doing something similar to my output vector here, only I’m swapping these last two dimensions one an embed size effectively making this the output vector transpose. This way, I’ll be able to calculate the.product between these two vectors by performing batch matrix multiplication on them, and that’s just what I’m doing here. First, I’m calculating the loss term between my input vector and my correct target vector. I’m using batch matrix multiplication and then applying a sigmoid and a log function. Here, I’m squeezing the output so that no empty dimensions are left in the output. Next, I’m doing something similar only between my input vector and my negated noise vectors. So, this is the second term in our loss function. I’m using batch matrix multiplication, applying a sigmoid and a log function, and then I’m summing the losses over the sample of noise vectors. Okay finally, I’m adding these two losses up negating them since I kept them positive during my calculations and taking the mean of this total loss. This way, I’m returning the average negative sample loss over a batch of data. Then I can move on to creating this model and training it. This training loop will look pretty similar to before, but with some key differences. First, I’m creating a unigram noise distribution that relates noisy vectors to their frequency of occurrence, and this is a value I calculated earlier in this notebook. So, I’m defining our noise distribution as the unigram distribution raised to a power of three-fourths as was specified in the paper. Then, I’m defining my model passing in the length of our vocabulary and embedding dimension which I left as 300, and this noise distribution that I’ve just created, and I’m moving this altered GPU. Then I have another key difference, instead of using NLL loss, I’m using my custom negative sampling loss that I defined above. In my training loop, I’ll have to pass in three parameters to this loss function. So, I’m training for five epochs again, getting batches of input and target words. Then using my three different forward functions I’m getting my input embedding, my desired output embedding, and my noise embeddings. So, forward input takes in my inputs, forward output takes in my targets and forward noise takes in two parameters. It takes in a batch size and a number of noise vectors to generate. Then to calculate my loss, I’m passing in my input, output and noise embeddings here. Then, I just have the same code as before, performing backpropagation and optimization steps as usual, and I have my validation similarities that I’m going to print out along with the epoch and loss, a little more information. So, note that I chose to define my three different forward functions just so I a get the vectors that I needed to calculate my negative sampling loss here. You can try training this yourself just to see how much faster this training is. Then imprinting data less frequently because it’s generated quicker. So here, after the first epoch, we see our usual sort of noisy relationships. But by the end of training, we see words grouped together that makes sense. So, we have mathematics, algebra, calculus, we have ocean, islands, Pacific, Atlantic, and some smaller words that all seem to be grouped together as well. Once again, I visualize the word vectors using T-SNE. This time I’m visualizing fewer words and I’m getting the embeddings from our input embedding layer only. Then I’m passing these embeddings into our T-SNE model and this is the result I get. I can see some individual integers grouped over here, some educational terms and war and military terms over here. I see some governmental terms and other relationships and it’s pretty interesting to poke around a visualization like this. The word2vec model always makes me think about how a learned vector space can be really interesting. Just think about how you might embed images and find relationships between colors and objects or how you might transform words using vector arithmetic. Building and training this model was also quite involved and if you feel comfortable with this model code and especially manipulating models to add your own forward functions and custom loss types, you’ve really learned a lot about the Pythonic nature of PyTorch and model customization. In addition to implementing a very effective word2vec model. So, great job on making it this far, and I hope you’re excited to learn even more.