Where the embeddings are fast becoming the de facto choice for representing words, especially for use and deep neural networks. But why do these techniques work so well? Doesn’t it seem almost magical that you can actually do arithmetic with words, like woman minus man plus king equals queen? The answer might lie in the distributional hypothesis, which states that words that occur in the same contexts tend to have similar meanings. For example, consider this sentence. Would you like to have a cup of blank? Okay. How about, I like my blank black. One more, I need my morning blank before I can do anything. What are you thinking? Tea? Coffee? What give you the hint? Cup? Black? Morning? But it could be either of the two, right? And that’s the point. In these contexts, tea and coffee are actually similar. Therefore, when a large collection of sentences is used to learn in embedding, words with common context words tend to get pulled closer and closer together. Of course, there could also be contexts in which tea and coffee are dissimilar. For example, blank grounds are great for composting. Or, I prefer loose leaf blank. Here we are clearly talking about coffee grounds, and loose leaf tea. How do we capture these similarities and differences in the same embedding? By adding another dimension. Let’s see how. Words can be close along one dimension. Here, tea and coffee are both beverages, but separated along some other dimension. Maybe this dimension captures all the variability among beverages. In a human language, there are many more dimensions along which word meanings can vary. And the more dimensions you can capture in your word vector, the more expressive that representation will be. But how many dimensions do you really need? Consider a typical neural network architecture designed for an NLP task, say word prediction. It’s common to use a word embedding layer that produces a vector with a few hundred dimensions, but that’s significantly small compared to using one heart encodings directly, which are as large as the size of the vocabulary, sometimes in tens of thousands of words. Also, if you learn the embedding as part of the model training process, you can obtain a representation that captures the dimensions that are most relevant for your task. This often adds complexity. So unless you’re building a model for a very narrow application like one that deals with medical terminology, you can use a pre-trained embedding as a look-up. For example, work to veck or glove. Then you only need to train the layer specific to your task. Compare this with the network architecture for a computer vision task, say, image classification, the raw input here is also very high dimensional. For example, even 128 by 128 Image contains over 16 thousand pixels. We typically use convolutional layers to exploit the spatial relationships and image data and reduce this dimensionality. Early stages and visual processing are often transferable across tasks, so it is common to use some pre-trained layers from an existing network, like Alex nad or BTG 16 and only learn the later layers. Come to think of it, using an embedding look up for NLP is not on like using pre-treated layers for computer vision. Both are great examples of transfer learning.