4 – 4 Data Subsampling V1

Okay, let’s get started with implementing the skip-gram word2vec model. The first thing you want to do is load in the necessary data. In this example, I’m using a large body of text that was scraped from Wikipedia articles by Matt Mahoney. If you’re working locally, you’ll actually need to click this link to download this data as a zip file, and we’ll move it into our data directory, and unzip it. You should then be left with a file just called text8 in our data directory. So, I’ve already put that data in the data directory, and here, I’m loading that file in my name and printing out the first 100 characters. It looks like the first section of text is about anarchism and the working class. So I loaded that in correctly, and then I want to do some preprocessing. Essentially, I want to break this text up into a giant list of words so that I can build up a vocabulary. So here, I’m going to do that using a function in the provided utils.py file called preprocess. Let’s actually take a look at this code. So here’s our utils.py file and our preprocess function. This function takes in some text and you can see that it does a few things. First, in all of these lines, it converts any punctuation into tokens. So a period is changed to a bracketed period token and so on. Next, we see that it stores the number of times certain words appear in the text using a Counter. A Counter is a collection that will basically return a dictionary of words and their frequency of occurrence. Here, we’re creating a list of trimmed words that basically cuts all words that show up five or fewer times in this dataset. This should greatly reduce issues due to noise in the data, and it should improve the quality of the vector representations. Then, finally, it returns those trimmed words. So back to our notebook, I’m going to say words equal utils.preprocess text, and I’ll print out the first trim 30 words. This may take a few moments to run since our text data is quite big. Then you should see an output like this. Pretty much the same text that we saw above, only the words are split into a list. Here, I’m going to print out some statistics about this data. I’m printing out the length of the text so a word count of our data, and I’ll print out the number of unique words. To get the number of unique words, I’m using the built-in Python data type set, which if you recall from the last lesson, will get rid of any duplicate words. So, we have a set of only unique words in this text. So, you can see that we have over 16 million words in this text, and over 60 thousand unique words, and these numbers will be useful to keep in mind as we continue processing. Next, I’m creating two dictionaries to convert words to integers and back again, integers to words. This is our usual tokenization step. This is again done with a function in the utils.py file, create lookup tables. So, let’s take a look at what this function is doing. So, this function takes in a list of words in a text and it returns two dictionaries that map from our vocabulary to integer values and back. You may notice an interesting use of counter here. First, this is creating a sorted vocabulary. So this is a list of words from most to least frequent according to the word counts returned by counter. Then integers are assigned in descending frequency order. So the most frequent word like B is given the integer 0, and the next most frequent is 1 and so on. So in our notebook, this function returns are two dictionaries. Once we have those, the words are then converted to integers and stored in the list into words. I’ll print out the first 30 tokenized words here just to check that they make sense. So, if we look at these values and back to our list of words above, we’ll be able to see that ‘the’ and ‘of’ are some of the most common words in our dictionary. We can see that ‘the’ is tokenized as the integer 0, and it looks like ‘of’ is the next most frequent word tokenized as 1. We have over 60,000 words in our vocabulary, so all of these token value should be integer values in that range. Now, our goal is to implement word2vec, which relies on looking at the context around a word of interest. We want to define our context very carefully, basically looking at a window of the most relevant words around a word of interests. There are some words that are almost never going to be relevant because they’re so common, words that show up anywhere and really often such as the, of, and for. These don’t provide much context to other nearby words. So if we discard some of these common words, we can remove some noise from our data, and in return, get faster training and better vector representations. This process is called subsampling. This will be your first task. Subsampling works like this. For each word wi in our training set, you want to discard it with a probability given by this equation. The probability of discarding a word w is equal to 1 minus the square root of t over that words frequency, and t is the threshold value that we set. So say, we’re thinking of discarding the word ‘the’, word index 0. Let’s say, it occurs one million times in our 16-million long dataset. These are approximations but this is one million over 16 million here the frequency of occurrence. The numerator is a threshold I’ve set, which is 1 times 10 to the negative fifth. So, if I just run these values through our equation, I’m going to get a probability of getting rid of this word 98.7 percent of the time. Even after discarding the majority of these inner text will still leave over 12,000 of the original one million these inner text. The idea with subsampling is really to just get rid of a lot of these frequently occurring words so that they’re not always affecting the context of other words while simultaneously keeping enough examples to learn a word embedding for that word. So, the subsampling equation says the probability that we discard a word is going to be higher if that word’s frequency is higher. Here I provided some code, a threshold to start you out, and a dictionary of word counts. This is using the counter collection which takes in our list of encoded words, and returns how many times they appear in that list, and I can print out the first key value pair in this list. So here, I can see that the word token 5233 appears 303 times in our text. I want you to use this information to calculate the discard probability for each word in our vocabulary, then use that to create a new set of data train words, which will basically be our original list of int words only with some of our most frequent words discarded. This is more of a programming challenge rather than a deep learning task but preparing data is an important skill to have, so try to solve this task, and next, I’ll show you my solution.

%d 블로거가 이것을 좋아합니다: