We’ve talked a bit about how neural networks are designed to learn from numerical data. In our case, word embedding is really all about improving the ability of networks to learn from texted data. The idea is this, embeddings can greatly improve the ability of networks to learn from text data, by representing that data as lower-dimensional vectors. Let’s think about this in an example. Usually, when you’re dealing with text and you split things up into words, you tend to have tens of thousands of different words in a large data set. When you’re using these words as input to a network like an R&N, we’ve seen that we can one-hot encode them. What that means is that you have these giant vectors that are like 50,000 units long, and only one of them is set to one, and all the others are set to zero. Then, you pass this long vector as input to some hidden layer in the network. The output of this hidden layer is calculated by multiplying that input vector by some matrix of learned weights. The result is a huge matrix of values. Most of which are zero, because of the initial one-hot vector. So, all these computing resources are used on values that do not hold any information, and this is really computationally inefficient. To solve this problem, we can use embeddings, which basically provide a shortcut for doing this matrix multiplication. To learn word embeddings, we use a fully-connected linear layer like you’ve seen before. We’ll call this layer the embedding layer, and its weights are the embedding weights. These weights will be values that are learned during training this embedding model, and they make up a useful weight matrix. With this matrix, we can skip the big multiplication step from before, by instead grabbing the values for the output of our hidden layer directly from a row in our weight matrix. We can do this because the multiplication of a one-hot encoded vector with a weight matrix, returns only the row of the matrix that corresponds to the index of the one or the on input unit. So, instead of doing matrix multiplication, we can use the embedding weight matrix as a lookup table. Instead of representing words as one-hot vectors, we can encode each word as a unique integer. As an example, say we have the word heart encoded as the integer 958. Then, to get the hidden layer values for heart, we just take the 958th row of the embedding weight matrix. This process is called an embedding lookup, and the number of hidden units is the embedding dimension. So, the embedding lookup table is just a weight matrix, and the embedding layer is just a hidden layer. It’s important to know that the lookup table holds weights that are learned during training just like any weight matrix. So, this is the basic idea behind how embedding works. In the next few sections, you’ll see how word “the vec” uses the embedding layer to find vector representations of words that contain semantic meaning.