2 – Data Preprocessing

So now that we have all the words, what we need to do is we need to encode all of our views as integers. So we’re going to pass in the reviews where every word is an integer and that’s going to go into our embedding layer. So the first step, which I’m going to leave to you, is to create a dictionary that is mapping these vocabulary words into integers. And one thing to know is that later we’re going to be padding our input vectors with zeros. So that means that when we convert our words to integers, they have to start at one, they can’t start at zero or the network’s going to get confused. So then once you have this mapping, then you should convert our reviews, from like that list of reviews, convert all the words into integers. So now we have all our reviews, but instead of words we have integers. OK. And then next, we need to change our labels from positive and negative into ones and zeros. So our labels, if you open it up, each review is labeled with a string. Either positive or negative. But we need those to be ones and zeros, so that we can actually use it in our network and calculate the cost. So you can do that here, also. And so if you built labels correctly, then what you should see if you run this cell, is that there’s actually one review that has no length. I’m not really sure why, I didn’t look into it very much, but it’s the problem because it’s going to end up kind of breaking things later. So we need to get rid of it. Secondly, there’s some really long reviews in here. So one of them is, the longest one is 2,514 words long. So that is pretty, that is really long and it’s much too long to put into our RNN because it is going to take a really long time to train that many steps in our current network. So what we’re going to do instead is just truncate every review down to 200 steps, or 200 words. So for reviews shorter than 200 words, we can just pad to the left with zeros. So for instance if, you can kind of see down here, so if the review is like best movie ever, then in numbers that would the 117, 18, and 128. Then this row would look like zero zero zero zero for 197 zeros, and then the last three numbers are the integers corresponding to the words. Like that. But the first thing I want you to do is remove the review that has zero length. So you can do that here. And then next create this array features that contains the data we’ll pass to the network. So there’s going to be one row for each review, and then there’s going to be 200 steps, or 200 words, in each row. And then, like I was saying before, if you have a review that’s less than 200 words, pad it to the left with zeros. And then if it’s more than 200 words, just use the first 200 words as this feature vector. So if you built features correctly, it should look something like this. So for the first review, we just get a bunch of zeros, and then here is our review. Then this is what it looks like for a review that had more than 200 characters. So it looks like this. Where like this 200 element vector is completely filled out with no padding.

%d 블로거가 이것을 좋아합니다: