Now, the last model took quite a while to train, and there are some ways that we can speed up this process. In this video, I’ll talk about one such method which is called negative sampling. So, this is a new notebook, but it contains basically the same info as our previous notebook including this architecture diagram. This is our current architecture where we have a softmax layer on the output, and since we’re working with tens of thousands of words, the softmax layer is going to have tens of thousands of units. But with any one input, we’re really going to have one true context target. What that means is, when we train, we’re going to be making very small changes to the weights between these two layers even though we only have one true output that we care about. So, very few of the weights are actually going to be updated in a meaningful way. Instead what we can do is approximate the loss from the softmax layer, and we do this by only updating a small subset of all the weights at once. We’ll update the weights for what we know to be the correct target output, but then we’ll only update a small number of incorrect or noise targets usually around 100 or so as opposed to 60,000. This process is called negative sampling. To implement this, there are two main modifications we need to make to our model. First, since we’re not taking the softmax output over all the words, we’re really only concerned with one output one at a time. Similar to how we used an embedding layer to map an input word to a row of embedding weights, we can now use another embedding layer to map the output words to a row of hidden weights. So, we’ll have two embedding layers, one for input words and one for output words. Second, we have to use a modified loss function that only cares about the true target and a small subset of noisy and correct target context words, and that’s this big loss function here. It’s a little heavy on notation, so I’ll go over it one part at a time. Let’s take a look at the first term. We can see that this is a negative log operation, and this little loop, this lowercase sigma is a sigmoid activation function. A sigmoid activation function scales any input from a range from zero to one. So, let’s look at the input inside the parentheses. UW0 transpose is the embedding vector for our output target word. So, this is the embedding vector that we know as the correct contexts target for a given input word. This T here is the transpose symbol. Then we have VWI which is the embedding vector for our input word. In general, you will indicate an output embedding and V are input. If you remember from doing cosine similarity a transpose multiplication like this is equivalent to doing a.product operation. So, this whole first term is same that we take the log sigmoid of the.product of our correct output word vector with our input word vector, and this represents our correct target loss. Next, we want to sample our outputs and get some noisy target words, and that’s what the second part of this equation is all about. So, let’s look at this piece by piece. This capital sigma means we’re going to take a sum over all of our words WI. This P and W indicates that these words are drawn from a noise distribution. The noise distribution is our vocabulary of words that are not in the context of our input word. In effect, we want to randomly sample words from our vocabulary to get these noisy irrelevant target words. So P and W is an arbitrary probability distribution which means we can get to decide how to weight the words that we’re sampling. This could be a uniform distribution where we sample all words with equal probability or it could be according to the frequency that each word shows up in our text corpus, the unigram distribution UW. In fact the authors of the negative sampling paper found the best distribution to be a unigram distribution raised to the three-fourths. Then we get to this last part which looks very similar to our first term. This takes the log sigmoid of the negated.product between a noise vector UWI, and our input vector from before. To give you an intuition for what this whole loss is doing here, remember that this sigmoid function returns a probability between zero and one. So, the first term in this loss is going to push the probability that our network will predict the correct context word towards one. In the second term, since we’re negating the sigmoid input, we’re pushing the summed probabilities that our network will predict the incorrect noisy words towards zero. Okay. So next, I’ll present your task which will be to define this negative sampling model in code.