4 – Transforming Text into Numbers

Now that we have validated that our theory, that individual words instead of review are predictive of that review’s positive or negative label. Now it’s time to transform our datasets into numbers in a way that respects this theory and this belief, so that our neural network can search for correlation in this particular way. So what we want to be able to do is we want to present the words as input into the neural network in such a way that it can look for correlation and make the correct positive or negative prediction of the output. So the most natural way to start here is simply count each word and input those counts as inputs to the neural network. Pretty simple, well defined and I think that should have correlation with what you want to predict. Now as far as predicting and positive and negative, obviously neural nets can’t predict the word positive. Well, some more advanced ones can but that’s not what we’re trying to do here. Instead, we’re going to represent positiveness and negativeness as a number where positive is the number 1 and negative is the 0. Now the reason that we’re doing this in one neuron and kind of giving it two sides that the network has to decide between is that we know that positive and negative are mutually exclusive. But we’re not going to train our network to ever say that a review is both positive and negative. And by modeling it this way, we can make these two different labels mutually exclusive. This reduces the number of ways that a neural network can make a mistake. Which reduces the amount it has to learn and actually helps it learn this particular pattern. In some ways some other [INAUDIBLE] for example, have five different output labels for different granularities. So for example, can have five stars, you can pick one star to three stars to five stars. And it turns out that sometimes it can actually hurt the neural net and make it more difficult to predict if it actually has to predict which star was most likely. Because it allows it to sort of make double positive predictions, a three and a four where the four is incorrect and the three is correct. But they share a lot of signals, it creates ambiguity in the network. But in this case, because you only have two labels, we can force the network to have to choose between the two of them thus reducing the number of ways that it can make an escape. One of the themes throughout this computorial is going to be making the prediction as easy as possible. And framing the problem in such a way where it’s as easy as possible for the neural net to make this prediction. What do we need and what was Project 2 going to be about? Project 2 is about setting up two functions that take our input and output data and transform them into the appropriate 1 0 binary representation or I guess on the output and then the count’s on the input. So in this case, the first function I want you to build takes a review, extracts the words from the review. And then counts them and it puts those counts into a vector. Now that vector has to be constant length, it needs to be the length of the vocabulary. And then I want you to create another function that just maps positive, negative to a 1 or 0. So go ahead and create those functions and then I’ll show you how I create them and we can compare notes. See you then.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading