5 – 5 Subsampling Solution V1

Here is my solution for creating a new list of train words. First, I calculated the frequency of occurrence for each word in our vocabulary. So, I stored the total length of our text in a variable, total_count. Then, I created a dictionary of frequencies. For each word token and count in the word counter dictionary that was given, I added an item to this dictionary, where the word was the key and the value was the count of that word over the total number of words in our text, the frequency. Then, I calculated the discard probability as p_drop. This is another dictionary that maps words to the drop probability. Here, I’m just using the subsampling equation to get that, which is 1 minus the square root of our threshold over that word’s frequency. Finally, I created a new list of train words. For each word in our list of int_words, I said I’ll keep this word with some probability. So, I generated a random value between zero and one, and I checked if that value was less than 1 minus the drop probability for that word. This is saying, okay, I want to keep this word with a probability of 1 minus p_drop. So, if I have a drop probability of 0.98, then the keyboard probability is 1 minus this p_drop, which will be 0.02. If I generate a value less than 0.02, which is unlikely, only then will I keep this word in my list of train words. There are other ways to solve this problem, but I like to frame this as a which words do I keep task. Okay. Then I’m printing out the first 30 words of this train data. This should look similar to the first 30 tokens in our int_words list. Only you’ll notice that most of the zeros and ones are gone. These were our most common words from before, and so this is looking as I expect, and I can move on to the next step, which will be defining a context window and batching our data.

%d 블로거가 이것을 좋아합니다: