5 – Mini Project 2 Solution

Right, so in this project we’re going to create our input and output data. So, for input data, we’re going to count all the words that happen in a review, and then we’re going to put them into a fixed length vector. Where each place in the vector is for one of our words of our vocabulary. So the first thing we do is we count our vocabulary. Looks like we have just over 74,000 words. Now we’re going to create our empty vector. Now it’s generally a good practice to pre-allocate this vector as just something that’s empty and then edit it as you go. Because one of those expensive things you can do in computer science is allocate new memory. So we don’t want to have to create this new vector from scratch every time that we use it. So we’re going to create an empty one and then we’re going to createa function that modifies this vector with the proper counts. So, first thing we need to do is decide which place in this vector goes to each word, and create a variable that allows us to research that. Now, it doesn’t really matter which place that we put it in, it’s like, horrible can be down here, or it could up there. But as long as whatever we choose we kind of stick with, right? So I’m going to create just a dictionary that allows us to look up every word that’s in our vocabulary according to the place that it has in that vocabulary. And then we’re going to create our function. So layer here the global variable. We’re going to clear out the old ones. Then we’re going to iterate through each word in our review. And we’re going to allocate the position in that vector where we’re incrementing, so that there’s a count for each one. Then we tried out in the first review, for the review 0 was this, right? And it looks like it worked. Actually one of the words, presumably the empty one when I tokenized it, happened 18 times. How about that? So get_target_for_label seems to work, so label(0) was positive, and label(1) I think was negative. So yeah, it looks like it’s working great. S this will work great for us. This is our input and output dataset and I hope that yours created kind of variables that look a lot like this. The nice what we’re doing here is, and I guess the thing to take away, is mostly this efficiency piece, right? So when you’re creating these vectors, try not to allocate completely new vectors for your data. The second thing that we’re also not doing is pre-generating the entire data set, right? because that would be a matrix that is 74,000 by, how many train examples? 25,000, so 74,000 vocab_size x 25,000, that would be, man [LAUGH] around 2 billion integers. Which is just, that’s a lot of stuff to store on your machine when, in reality, we can populate this pretty easily. And most of them are zeroes, and they’re pretty quick we need to generate. So this is just generally good practice for creating your data set without filling up your RAM on your laptop. So, that’s our input and output data set. Those are kind of the things to watch out for. Don’t allocate too much memory at once, and don’t create new variables all the time. These are forms that we’re going to use in our neural net, right? So in the next section we’re going to be talking about how we’re going to put this together into our neural network. See you there.

%d 블로거가 이것을 좋아합니다: