Word2vec is just one type of forward embedding. Recently, several other related approaches have been proposed that are really promising. GloVe or global vectors for word representation is one such approach that tries to directly optimize the vector representation of each word just using co- occurrence statistics, unlike word2vec which sets up an ancillary prediction task. First, the probably that word j appears in the context of word i is computed, pj given i for all word pairs ij in a given corpus. What do we mean by j appears in context of i? Simply that word j is present in the vicinity of word i, either right next to it, or a few words away. We count all such occurrences of i and j in our text collection, and then normalize account to get a probability. Then, a random vector is initialized for each word, actually two vectors. One for the word when it is acting as a context, and one when it is acting as the target. So far, so good. Now, for any pair of words, ij, we want the dot product of their word vectors, w_i times w_j, to be equal to their co-occurrence probability. Using this as our goal and a suitable last function, we can iteratively optimize these word vectors. The result should be a set of vectors that capture the similarities and differences between individual words. If you look at it from another point of view, we are essentially factorizing the co-occurrence probability matrix into two smaller matrices. This is the basic idea behind GloVe. All that sounds good, but why co-occurrence probabilities? Consider two context words, say ice and steam, and two target words, solid and water. You would come across solid more often in the context of ice than steam, right? But water could occur in either context with roughly equal probability. At least, that’s what we would expect. Surprise. That’s exactly what co-occurrence probabilities reflect. Given a large corpus, you’ll find that the ratio of P solid given ice to P solid given steam is much greater than one, while the ratio of P water given ice and P water given steam is close to one. Thus, we see that co-occurrence probabilities already exhibit some of the properties we want to capture. In fact, one refinement over using raw probability values is to optimize for the ratio of probabilities. Now, there are a lot of subtleties here, not the least of which is the fact that the co-occurence probability matrix is huge. At the same time, co-occurrence probability values are typically very low, so it makes sense to work with the log of these values. I encourage you to read the original paper that introduced GloVe to get a better understanding of this technique.