6 – 6 Defining Context Targets V1

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture for each word in the text, we want to define a surrounding context and grab all the words in a window around that word with size C. When I talk about a window, I mean a window in time. So, like two words in the past and two words in the future from our given input word. More generally than two words in the past and future, I’m going to say we want to define a window of size C. Here, I have some text from the Mikolov paper on Word2Vec, “Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples. If we choose C equals five, for each training word, we’ll randomly select a number R in range one to C, and then use R words from history and R words from the future of the current word as correct labels.” So, this is saying that we don’t want to choose too big of a window because too big of a window will give us irrelevant context. In other words, good context words are usually the ones closest to the current word rather than farther away, and we want to include some randomness in how we define our context. If we define a context window of size C equals five, then we’ll create a range R that’s going to be a random integer between one or five. So, say we get an R equal to two as an example, then we’ll define the context around a given word to be the two words that appear right before and after our word of interest. I have an example here. Say we’re interested in the word at the second index in this list, 741. If we randomly generate an R equal to two, we’ll be interested in the two tokens before and after this word. I want you to write a function that will return context words in a list like this. This will be the function get_target, which takes in a list of word IDs, and index of interests, and a context window size. So, the effect of getting words within a random rage R instead of a consistent larger range C is that you’re more likely to get words that are right next to your current word, and less likely to get words that are further away from your current word. So, what you’re really doing is going to be training on context words that are closer to your word of interests and likely more relevant, more often. So here, I’ve left this function for you to fill out. Now, there are some special cases. If the index that’s passed in is zero or your range cannot go back in the past as far as you want, then you can start your context at the start of the past and list of words. You can test out your implementation in this cell below. Next, we’ll use this function to actually batch the data, and so it’s important that this is implemented correctly. As usual, if you’re stuck or want to see my solution, checkout my solution video next.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading