So in this notebook, I’ll be leading you through a Word2Vec implementation in PyTorch. Now, you’ve just learned about the idea behind embeddings in general. For any dataset with lots of classes or input dimensions like a large word vocabulary, we’re basically skipping the one-hot encoding step, which would result in extremely long input vectors of mostly zeros. We’re taking advantage of the fact that when one-hot vectors are multiplied by a weight matrix, we will just get one row of values back. For example, if we have a one-hot vector that has its fourth index on, and we multiply this by a weight matrix, we’ll get the fourth row of weights back as a result. So, what we can do, is actually just have input numbers instead of one-hot vectors, and then we can use an embedding weight matrix to look up the correct output. In this case, we see the word heart is encoded as the integer 958, and we can look up the embedding vector for this word in the 958 row of an embedding weight matrix, this is also often called a lookup table. In text analysis, this is great. Because we already know that we can convert a vocabulary of words into integer tokens. So, each unique word will have a corresponding integer value. If we have a vocabulary of 10,000 words, we’ll have a 10,000 row embedded weight matrix, where we can look up the correct output values. These output values which will just be rows in this weight matrix will be a vector representation of that input word. These representations are called embeddings, and they have as many values as is weight matrix has columns. This width is called the embedding dimension, it’s usually some value in the hundreds. Now, Word2Vec is a special algorithm that basically says, “Any words that appear in the same context in a given text should have similar vector representations.” So context in this case, basically means the words that come before and after a word of interest. Here are a couple of examples. In a body of text, you’ll find a variety of sentences, and these ones all involve drinking some beverage. In some of these cases, even if I removed a word of interests, you may be able to guess what goes in there, just based on the context words surrounding it. So here it says, “I often drink coffee in the mornings. When I’m thirsty, I drink water, and I drink tea before I go to sleep.” The mention of I drink before these words, makes these contexts similar. So, we’ll expect these words coffee, water, and tea to have similar word embeddings. You can imagine that if we look at a large enough text, we may also see that coffee is more closely associated with morning time and so on. So, just by looking at a word of interest and some context words that surround it, Word2Vec can find similarities between words and relationships between them. In fact, for such similar words, Word2Vec should produce vectors that are very close in vector space, different words will be some distance away from each other. In this way, we’re actually able to do vector arithmetic, and that’s how Word2Vec confined mappings between words in the past and present tense for example. So, mapping the verb drink to drinking, and swam to swimming, is going to be the same transformation in vector space. In practice, Word2Vec is implemented in one of two ways. The first option is to basically give our model the context, so several words surrounding a word of interest, and have it tried to predict the missing word. So, context words in and a single word out, this is called the continuous bag of words or a CBOW model. The second option is the reverse, to input our word of interests and have our model tried to predict the context. So, one word n and a few you context words out. This is the skip-gram model, and we’ll be implementing Word2Vec in this way because it’s been shown to work a bit better. You’ll notice that for either of these models, we’ll also have to formalize the idea of context to be a window of a specified size. So, something like two words before and two words after a word of interest. Here, for an input word w at time t, we have context words from t minus two to t plus two, that is minus two words in the past and plus two in the future. Notice that the context does not include the original word of interest. So, now that you’ve been introduced to this notebook and the Word2Vec skip-gram model. Next, I’ll show you the data that we’ll be working with and give you your first exercise.