1 – Implementing Word2Vec

Welcome back everyone. So this week you’re going to be exploring embeddings and implementing the word2vec model to understand them. So in this notebook you’ll actually be using TensorFlow to implement the word2vec model. Here are some resources for you to read. So you should check these out either beforehand or while you’re working on this to get a better understanding of what’s going on. So here is a great conceptual overview of word2vec by Chris McCormick and these are two papers from Mikolov and the others who worked on this architecture. So this is the first one, the original one and this is a paper with a bunch of improvements which you’ll also be implementing in this notebook. Then here are two implementations and this one is from the TensorFlow documentation. So when you’re dealing with text and you split things up into words you tend to have tens of thousands of different words and a large dataset. So when you’re using these words as your input you’re typically going to be one-hot encoding them. What that means is that you have these giant vectors that are like 50,000 units long right, elements long and only one of them is set to one and all the others are set to zero. So then when you do this matrix multiplication to get the values of the hidden layer you’re doing this massive multiplication that’s like 50,000×300 different weight parameters in this giant matrix. And the deal is when you do this matrix multiplication only one of these is set to one. So you just get zeros out of most of these operations. So this is just completely computationally inefficient. So, to solve this problem we can use what are called embeddings. Embeddings are basically just a shortcut for doing this matrix multiplication. Everything is basically the same. You have a hidden layer and you have this, you know, weight matrix from the inputs to your hidden layer. It’s just that the way that we actually get the values for this hidden layer is different. So what we do is we actually just kind of skip this entire matrix multiplication and just look up from a table what the values for this hidden layer is. And we could do this because we are multiplying a one-hot encoded vector by a matrix, you just get the row that corresponds to the element that was on. So for instance if the fourth element here is 1 and all the rest are 0 then you’re only going to get the fourth row from this matrix because all the other multiplications through here are zero. So basically what that means is that to do this matrix multiplication, we know that this is the fourth element, so we can just get the fourth row and that is the result. So we don’t actually have to do the matrix multiplication at all. So then what we do is we actually tokenize our words. That means we convert them into integers and then we still have this, you know, embedding weight matrix that is going from an input to a hidden layer. But now we’re just calling this a lookup table because we literally just look up the row and this matrix that corresponds to our word. And that is the values of our hidden layer. So then the number of units in the hidden layer is called the embedding dimension. And basically you just get a different vector, that is the size of your embedding dimension for each of your words. So then, in this way we just completely skip this matrix multiplication. All we’re doing is we’re saying you know this word heart is now this integer 958 and then the matrix multiplication just gives us this row so we’re just going to look up the 958 row and we’re going to use that as our hidden layer values. So there’s really nothing magical going on with embeddings. The embedding lookup table is just a weight matrix. The embedding layer is a hidden layer and then this lookup is just a shortcut for the matrix multiplication. And then this lookup table since it’s just a weight matrix, it’s trained like any other weight matrix that you’ve used before just using back propagation to learn better parameters that help you classify or predict whatever you’re trying to do. So embeddings aren’t only used for words you can basically use them anywhere you have a massive number of classes. So what we’ll be doing in this notebook is looking at a particular type of model called word2vec that uses the embedding layer to get these like representations of these words from these vectors. So each of these vectors in here is going to represent a word and it’s going to actually represent semantic meaning of those words. So word2vec is an algorithm that finds efficient representations by using these vectors from the embedding layer. And when it’s properly trained, the vector will contain semantic information. So what we mean by that is that words that show up in similar contexts, like colors like black white and red, they’re going to show up in similar context and so those words are going to have similar representations and you know these vectors, the value of the vectors is going to be similar between these colors because they show up in similar contexts. So there are two architectures for implementing word2vec. So there is CBOW. This is the Continuous Bag Of Words. There’s also Skip-gram. And Skip-gram is going to be the one that we’re looking at because it has been shown to work better than CBOW. So the idea here is that you have some word that you’re looking at in your in your dataset, right, so at some step in your dataset t, then you are looking at words that are around that target where at t. So t-2, which is like two before it and t+1, which is the word after your target word. So this sort of represents the context that this word is showing up in. So in CBOW your input is the context around your word and then you’re trying to predict the word. So you’re like looking around some window in your text and you’re trying using that as the context and you’re trying to predict the word that shows up in the middle of this window. In skip-gram that’s inverted. You’re passing in some word and then you’re trying to predict the words that show up around it in the text. So you’re trying to predict like what context does this word appear in. So in this way you’re going to be training this network to understand the context and the meaning of words. So for instance, again like black, white, and red, they’re going to show up in a similar context. They have similar like words surrounding it. And so then the passing in black and white and red is going to predict to the same words and the context. And so they’re going to have similar projections. They’re going to have similar embedding vectors. OK, so let’s get started building this. So first we just — and part of the package that we’re using — and then here we’re using the text8 data set. So this is a bunch of Wikipedia articles that have been cleaned up and they’re nice to use for this purpose. So in general for training word2vec, you’re going to want a really big dataset of text, so a lot of times is like Wikipedia articles, news articles, books, that sort of thing. And this has been done for a whole bunch of different languages. And so you can get these nice representations of words and a whole bunch of different languages and it just sort of use it as the embedding layer whatever model you’re doing. But here I’m going to show you how to train- actually train up one of these embedding layers so that you can use it in your future models. So here just doing a little bit of pre-processing so that sort of clean up the punctuation, so turning punctuation into tokens. So for example, a period is changed to the word period with some brackets around it. So the reason want to do this is because you want to represent everything as words. And when you have something like tokens and this comma, you want to represent tokens without a comma and tokens with a comma as the same word. So basically what you do you split off this comma and then you just replace it with the word comma and then this will help you with other problems such as generating new text or something. So here I am also removing words that show up five or fewer times. So it’s just really rare words that sort of just add noise to the data and they kind of harm the quality of your vector representations because they don’t show very often. So that kind of confuses the training. So all these is in this utils module that I wrote, so check that out. Then here I am creating two dictionaries that convert words to integers and then integers back to words. And then using that to convert all of our words which are just text converting them into integers and so now we have int_words. So basically this is just a long list of all the words in our dataset. But they’ve been converted to integers and so now we can actually pass these into our network. Because remember when we’re doing our lookups, we’re actually passing in integers, we’re not passing in the words. So I’ve basically tokenized all the words. And now when we build our network and actually do this embedding lookup, we’re passing in integers and we can do to lookup with these integers. OK, so now the first thing for you to do. So there’s a lot of words that show up in text that don’t have a lot of like meaning or they don’t provide a lot of context for the words that show up around it. So like the, of, and, for, you know like all these words just show up a lot but they don’t really provide a lot of context for the words that are around them. So what we can do is just get rid of some of these really frequent words. And what this does it removes noise from our data and then it improves the training rate and we can get better representations in the end. This process of discarding free words is called the subsampling. And the way we do this is we calculate some frequency and we have a probability to drop each of these words. This t is a threshold parameter that basically let’s us kind of set a threshold for how frequent a word needs to be for us to drop it with some probability. So the idea is that for every word you calculate the frequency and then you calculate the probability that you would drop this word. And so it’s important to know that this is a probability. So basically as you’re scanning through and you see a word there’s some probability that you’ll drop it or keep it and you want to make this like randomly uniform over the entire dataset. So when we’re passing in batches, there is the equal probability that a word will be dropped in the first batch as there is in the last batch. OK, so I’m going to leave it up to you here to implement subsampling and for the words that are in int_words. So this is more of a programming challenge than anything like actually deep learning, but it’s really important for you to get experience with preparing your data. This is something you’ll have to do often in deep learning and just machine learning in general. So the idea here is to go through int_words and then discard each word with the probability given here and assign your new subsample data to train words. Go and work on this and I have a solution notebook, if you want to see how I did it. And I also have a solution video that you can check out after this. Cheers!

%d 블로거가 이것을 좋아합니다: