2 – Subsampling Solution

Hi everyone welcome back. So here are my solutions for the subsampling part. So here I am basically just going to, you know just calculate this probability and then scroll through the entire, like into words our data set and drop words that are frequent. So the first thing I did here was calculate the frequencies of all the words in the dataset. So to do that, I used a counter and passed it into words so this counts up all the words in our dataset. And then we can use this count to actually get the frequencies. So the frequencies are you know, like how often these words show up in the total data set. And so we can just take our word counts and then divide it by a total count which is like the total number of words in our dataset, which is not like the number of individual words but they are not the number of unique words but number of individual words in our dataset. So here I’m using a dictionary comprehension to create a dictionary where we can pass in a word and get out the frequency. So I’m just taking the count for each word and dividing it by the total word count. And we get our our words and our counts from word counts which is our counter, and the items so this just iterates over all the the keys and values in word counts. So the keys are the words and the values are the count of those words and then so we just take count and divide it by total count and that gives us the frequency of the words in our dataset. The next, I find the probability to drop, so basically, we just scan through like all the words in our frequencies and then just 1 – √t, the threshold that we define, divided by the frequencies. So here’s what I’m doing, so I’m using another dictionary comprehension to create a dictionary where the keys are the words and then the values are our probability. So one minus the square root of the threshold divided by the frequency of that word. Okay. So now that we have the frequencies and the probabilities that we’re going to drop it, we can go through all the words in int_words in our data and drop the ones that are frequent. So what I’m doing here is I am scrolling through each of the words in int_words and I’m looking at the probabilities to drop it, and comparing that with some randomly generated number. So what random.random does is it samples from a uniform distribution between 0 and 1. So, the way you think about this is like if the probability to drop a word is say 70%, so 0.7. Then we basically just generate a random number from random.random and so there is a 70% chance it’ll be less than 0.7 and there’s a 30% chance it’ll be greater than 0.7. So then if our probability is less than this randomly generated numbers so if our probability like 0.7 and if it’s less than random.random, so say it’s like we actually get you know 0.8 out of here. So then 0.8 > 0.7, and so then we’re going to actually keep the word. But this only happens 30% of the time because this is 0.7 and we’re only going to get a number larger than our drop probability 30% of the time. So then we’re going to keep 30% of the words. It might also be clear if I write it like this. So since this is our probability to drop a word then one minus this is the probability to keep a word. So again, if we’re generating random number and this is a probability to keep a word so then say like the probability to keep a word is like 60%, so then 60% of the time we’re actually going to generate a random number, and that’s less than that. So, again this is just like some probabilistic way to use this probability to decide if we keep or drop a word. And we want this to be probabilistic because we want this to be like uniform, unless it’s dropping, the sampling to be uniform across the entire dataset. So in any batch that you give in you want it to be, like you want it to have the same probability that, you know that ‘the’ was dropped from like the first batch to like some batch in the middle and the batch at the end. Hey see in the next video.

%d 블로거가 이것을 좋아합니다: