6 – Sentiment RNN 2

Welcome back. So now I’m going to go through my solutions to the exercises I had you do before. So you might have come here because you were having difficulties actually implementing that stuff or maybe you just want to see how I did it. You probably ended up doing it differently than me, which is totally fine. I mean, this is programming, so there’s thousands of different ways to do these things. This is just the way that I did it. So the first thing I did was encode the words with integers. What I like to do is I actually like to set the integers in order of frequency. So then the highest frequency word is going to be 1, the next one’s going to be 2, and so on. There’s not a real good reason to do that here, but it helped a lot when I was working on the Word2Vec model. So yeah, I just kind of like doing that. So if you didn’t do that, then that’s fine. So here I used a counter to count up the frequency of all the words and then I sorted it to get my vocab in order of the frequency of words. Here, I’m just building my dictionary, my vocab to integer dictionary, using a dictionary comprehension. So I’m just using enumerate on the vocabulary and starting it at 1 to get my index, my integer, and the word, and just setting the word as the key and the integer as the value. So like I said before, we want to start at 1 here, not 0, because we’re padding our input vectors with 0’s later. So if you had some word count as 0, like “be” encoded as 0, then when the network saw it later, it would think everything was the word “it” or something, or “the,” because I think “the” is usually the most frequent word. So here is where I converted my reviews from words into ints. Again, I’m using a comprehension. This time just a list comprehension. So I am taking each review from my reviews and I split it into words, and then for each of those words, I convert it to an integer. And then I’m using a list comprehension to build a new list using those integers. And I just append it to this new list, reviews_ints. In that way, I can build this list of lists where each of those inner lists is a review that has been converted from words into integers. Here is where I encoded the labels. So, again, this is where we are changing our labels from positive and negative to 1 and 0, respectively. So this was pretty straightforward for me. So I basically just used another list comprehension. I use a lot of list comprehensions because I really like them. So basically I’m just grabbing each label and labels and then I’m doing this test. So I’m saying, make it 1 if that label is positive, and else, make it 0. So in this way you could just loop through the whole thing, build a new list, and then change it into an array for labels. I typically like to keep my data as NumPy arrays rather than lists because it just makes it easier to work with. Basically, anytime you’re working with numbers in any way, it’s going to be better to use a NumPy array rather than just a normal Python list. So here I am filtering out this review with zero length. So, I don’t know, I was kind of lazy and didn’t want to look through the entire list to find which one has zero length, so all I did is I just looped through the entire thing, again, using a list comprehension. I love comprehensions. So I used a list comprehension and did this test, make sure the length of each of the reviews is greater than 0, and then if it is, add it. Otherwise, it won’t add this to the final list. So in that way, you just loop through whole thing and just make sure everything has a length greater than 0. So here I built my features array. This is actually pretty difficult, I think. It’s definitely not trivial and there’s a bunch of ways to do this. So if you had problems with this, it’s totally fine. Hey, you’re learning, that’s good. If it’s difficult for you, that means you’re learning, which is exactly what we want you to do. So basically, the way I did this is I first initialized an array of all 0’s that is the length of our reviews. So basically, like I was saying, there is one row for each review and then each row is going to be 200 sequence steps long. So our sequence length is 200. So then our features array of 0’s is going to be the length of the reviews by the sequence length. So this is going to give us however many reviews we have by 200, and we want to make these integers. So now, once I have the 0’s, then I’m going to fill it in. And so this is a pretty easy way to do this, because instead of like, oh, I had the [? left ?] pad things. I already have all my zeros. So I can just fill in the elements that I need to. So the way that I’m doing that is I’m looping through my reviews, the integers, with an enumerate. So I’m using this enumerate to actually pick out which row I’m on. So I’m grabbing a row from features, like row i, and then I’m looking at– so negative length row. So this is basically if the length of your row is less than 200, so say it’s 100, then I’m only going to grab the last 100 elements in this row. So that kind of takes care of the situation where you have less than 200 words in your review. So then what I’m doing is I am converting each of the rows, in my reviews into an array and then I just take the first 200 words in that array. So basically this takes care of both problems that we have. So one problem’s like, OK, if we have a review that’s more than 200 words, how do we get the first 200 words? Well, that’s what this is doing. And then if we have an array that’s less than 200 words, how do we add that to the end of our array? And that’s what this is doing. So this is just a really simple way to do this. And I’m sure you probably found a different way to do that, which is, of course, totally fine. This is how I did it. I’ve been using NumPy for a really long time, so I know how to do a lot of things like this just from experience. And that’s sort of the thing with programming in general, is that just as you do this more and more and more, you learn all these tricks that make doing things like this a lot faster and a lot more intuitive for you. Now this is my solution for the training, validation, and test sets. So this is pretty much my general approach for every time I make training sets and that sort of stuff. So I just define my split fraction, then I find some integer, so basically the index where I’m going to split my data. And you need this as an integer because the index that you’re going to pass into your array has to be an integer. So here, for my training x and my validation x, I’m just going to take my features up to the split index. And then for validation, it’s going to be from the split index and to the end. And then I do the same thing for my training y and my validation y. So I just use labels and set it features. And I basically do the same exact thing for my test sets, except I am taking half. So I’m just looking at the length of the validation set and I take half of it, so now that’s my index to split. And then I’m just taking my validation data and I’m keeping the first half for my validation and then the second half for the testing. right. So here is how I created my placeholders for the inputs and the labels and the key probability. This is pretty typical. So I just did tf.placeholder, these are going to be integers. And then I just left the batch size and the sequence size just kind of variable, like arbitrary, which is usually fine. And again, for the labels, even though we’re really only passing in one label at a time, so it’s either like 1 or 0, later in the network there’s some functions that require the labels to be two dimensional. So that’s why I just add a two dimensional thing here. So this is one of those things where first I did it as one dimensional, because in my head I’m like, oh, this only needs to be 1D. And then later, as I was building the network, I found that the spot where it’s calling a function and it wanted it to be two dimensional. So then I had to come back up to here and then recreate this labels placeholder with two dimensions. I find there’s a lot of stuff like that and in TensorFlow, you just kind of run through it and then at some points you find that the thing you defined earlier is not going to work because of shapes or something. So this is one of those things that you find out by building TensorFlow graphs. All right. So this is how I created the embedding layers. So as before, like the Word2Vec network, just create your embedding matrix just as variable. Here I’m just using a random uniform distribution. And then the size is like the number of words you have, and then you’re embedding size, so here I used 300. So what this is going to do is the output from the embedding layer is going to be 300 units long, so it’s going to be a 300 length vector. Well, 300 length vector for every input you have coming in, since in this case we are actually having like multiple, up to 200 words actually coming in. Here to actually do the look up, we have the tf.nn.embedding_lookup. So we’re just pass in our embedding matrix and our inputs and then we get our embedded vectors. And this is how I created our LSTM cells. So basically I just create a basic LSTM cell, set the size, then I can apply drop out to it, so pass in the LSTM cell and set the key probability, and pass this to our [? multi rnncell, ?] and then get the initial state. And then this is the forward pass. So here I’m just using a dynamic_rnn and then we pass in our cell which we created, and then the output from the embedding layer and the initial state. And then that just gives us our outputs from the LSTM cells and our final state. OK. So this is what it looks like when it’s all built and is trained. After– pretty quickly, after only like 25 batches going through, we have almost like 71% accuracy on our validation set. And then a bit more. Eventually we get up to I think like 83%. Oh, yeah, we’ve 84 here and then ends at 83. So it does this pretty well, it trained really quickly. Well, really quickly on a GPU. And then if we look at the test accuracy, we get about 83%. So I imagine if I added more layers or more units to the LSTM cells, then I could get better accuracy on this. Or it might just be that this is the extent to what we can do with our data set. I hope you enjoyed this lesson, and I hope you enjoyed implementing this sentiment analysis or sentiment prediction [? recurrent ?] neural network, and I’ll see you in the next lesson. Cheers!

%d 블로거가 이것을 좋아합니다: