7 – 7 Batching Data Solution V1

Here’s how I’m defining the context targets around a given word index. First, according to the excerpt from the paper, I’m going to define a range R. R is going to be a random integer in the range one to c, the window size. randint takes in a range that is not inclusive of the last number. So, that’s why I have a plus one here. Then I define the start and stop indices of my context window. The start will be a range of words in the past. That is the index of my word of interest minus my range R. This will only happen as long as that doesn’t get us to a negative index. If this operation does give us a negative value, then I just set my start index to the startup my list of words index zero. Then my stop index is where my feature words end. So, my word of interests index, plus our range R. Finally, I do not want my return target context to include the word at the past in index. So, I’m defining my target words as the words behind my index of interest from start to idx, plus the words in front, idx plus one to stop plus one. Then I’m returning these words as a list. Then when I go to test this out on a test set of word tokens, and I can run this a couple of times, I see that I get a variable length of words around my past an index of five. I can see the target does not include my index of interest. These line up just because I’ve created some input data, that’s the integer zero through nine. If you run the cell multiple times, you will see a different target based on a different randomly generated R. So, this looks good. Right below this function, I’ve defined a generator function. This function we’ll use our get_target function that we’ve just defined. get_batches takes in a list of word tokens a batch_size and a window_size. It makes sure that we can make complete batches of data. In this four loop, I’m iterating over our words one batch length at a time. I get a batch of words then for each word in a batch I’m calling get_target. This should return a batch of target words in a window around the given batch word. I’m calling extend here so that each batch x and y will be one row of values. Here, I’m making x the same length as y. Finally, it returns this list of input words x and target context words why using yield, which makes this a generator function. Then in the blue cell, we can test this batching out to see what it looks like when applied to some fake data here. So, I’m getting an x and y batch of data by calling next on our generator function. Here, I’ve passed in some int_text, a batch_size of four, and a window_size of five. When I run this cell, this output might look a little weird because everything’s been extended into one row. But I can see that I’ve made my desired four batches because I have four different input x values; zero, one, two, and three. If we take a look at the first input zero, we see is length three. So, the target must have also been length three. The corresponding context is one, two, three. All the targets in the future window that surround the input index zero, which is what I expect. For the other input, output batches, I can see that I’m generating targets that surround the input values one, two, and three. So, we have our batch inputs, and our target context. Now, we can get to defining and training a word to vec model on this batch data, which I’ll go over next.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading