2 – Framing the Problem

Let’s start by curating a dataset. Neural networks by themselves can’t really do anything. All a neural network really does is search for direct or indirect correlation between two datasets. So in order for neural network to train anything, we have to present it with two meaningful datasets. The first dataset must represent what we know. And the second dataset must represent what we want to know, what we want the neural net to be able to tell us. As the network trains, it’s going to search for correlation between these two data sets, so that eventually it can take one and learn to predict the other. Let me show you what I mean with our example data set. Right here we’re going to kind of load into a list a set of IMDB movie reviews. So these are movie reviews that people uploaded to the site IMDB. These labels come with those reviews as people have labeled them with one to five stars. In this case we’ve bucketed them into just positive reviews being higher than three stars and negative reviews being lower. So, we have 25,000 reviews. Here’s an example of one of those reviews which is a negative review, I believe, yeah. And then, this one, we could see if it’s a positive review and the label comes with a positive label. So, this is our dataset. Actually it’s two data sets, right? So we have this data set which is what we know, and what we will know in the future, right? And then this is what we want to know about the data set. So in this case we have two example data sets. We’re going to try to train a neural network to take this as input and be able to accurately predict this. So that when we see more human generated text in the future, in theory, our neural net will be able to classify. The first thing we want to do when we encounter a dataset like is develop a predictive theory. Now, a predictive theory is really about saying, okay, if I was the neural net, and I was going to try to figure out how to look for correlation in the data set, where would I look? Best thing that I like to do when developing predictive theory is just take a look at the dataset. Try to figure out if I can solve this problem myself. And then sort of look inward and say, okay, what am I using maybe under the hood to kind of understand whether this had a positive or negative sentiment. So let’s just read a few. This movie was terrible, but it has some good effects. Negative review, adrian pasdar is excellent in this film, he makes a fascinating woman. Negative review, comment, this movie is impossible, is terrible, very improbable, bad interpretation. Positive, excellent episode movie ala pulp fiction, days suicides, it doesn’t get more, and then it continues on. So already I’m starting to kind of get a feel. It seems to me pretty obvious these are really polarized examples. But what I’m going to be looking for is, okay, what in this is creating a correlation between my reviews data set and my labels data set. Well what is it in here? This is a list of characters, right? So when I actually load it in, it says native format and it’s just a list of, I guess in this case 26 plus different characters. Is there correlation in its current state? Well I don’t really think that letter M or letter T has much predicted power. Right, so we have M in negative examples and we have M in positive examples. It doesn’t really help us, so I don’t think that would be a good source. So the native state it’s in right now is probably not very good. Now let’s consider kind of the opposite spectrum where we take the entire review as sort of what this dataset is. Well it is very predictive. I mean, this review, every time we saw it, it was negative example. Unfortunately, we only saw it once. And I think I can likely expect that most reviews we see in the future are going to be relatively original. We’re going to see some people say this movie was terrible, or this movie was great, or really straightforward, things like that. But most reviews have nuance. They have a particular choice of words and sequence that’s not just not really going to be duplicated very often. So, training a neural net on the entire review might not work that well in the real world because we just don’t see it very often. So, great correlation but kind of poor generalization. What about kind of in between characters in the full review? So I noticed that in NEGATIVE examples, we see words like terrible, and improbable, and terrible, and terrible, and trash, and individual words that might have some correlation with these POSITIVE and NEGATIVE labels, in contrast to excellent, or fascinating, or excellent quality. So maybe it’s just actually the counts of the different kinds of words that are occurring in this, in these reviews. I think that’s kind of a better theory. Certainly better than characters and certainly better than the reviews as a whole. But before we just kind of run off creating a neural net, I find that it’s best to sort of do a quick validate, right? So, this is something that we think is true with the theory that we have. But before we actually go and do everything, we should see if what we have is naively predictable, right? Now what I typically do here is I just, I count them. Or I formulate a count based heuristic to try to see, okay, does this phenomenon seem to happen more for this label than it does for this label? Right, is it a good [INAUDIBLE]. So the first project that I would like for you to tackle and then I will show you how I tackle it, is to just think about how you would take this data set and validate our theory that words are predictive of labels. So go ahead and take a few minutes and take a crack at it and see if you can kind of just come up with a way of showing either is or is not predictive.

%d 블로거가 이것을 좋아합니다: