# 4 – Mini Project 1 Solution

All right, so. Presumably you kind of took a stab at validating our theory that words are predictive of labels. So now I’m going to show you how I would attack this problem, and then we can kind of compare notes. This learning style, I really, really like. Because I think that it’s most beneficial for people to sort of attack a problem in their head first and really realize what their kind of go-to set of tools is, and then see how another person might do it. I find that to be a really fulfilling and educational way to go about things and to really know where you’re at, as far as, how you tackle a problem like you would in the real world, when you’re not in the classroom. And then compare that to how someone else might do it. So yeah, let’s tackle this problem. So I’ve got a couple of go-to tools I always like to use from collections, import Counter. So we’re going to be counting words. And I find that the counter object is just so fast and so much easier than using dictionaries. And I’ll show you how to use it. And then numpy for whenever we do numerical calculations. So the first thing that we’re going to do is we’re going to just count sort of words that show up positively and words that show up negatively in various views. So for i range((len(reviews)), I-E-W-S. Gotta create our counter. So positive counts equals counter. This is how the counter objects work. So you create an empty counter and it acts a little bit like a dictionary. And we’ll just do total_counts so it’s cached. You could use the other ones to create it, okay. [BLANK_AUDIO] So they act like a dictionary, but you don’t have to actually create the original keys. You can just start incrementing them as if you had every key that you put in as you’re in a position to. You’ll see what I mean here in a second. So for i in range(len(reviews)). So for every review and label, if(labels[i) is a POSITIVE label, [BLANK_AUDIO] Then we’re going to basically count all the words that are in that review and add them to our positive counts. So for word in reviews[i].split, [BLANK_AUDIO] And positive_counts[word] += 1, total_counts(word) += 1. else: [BLANK_AUDIO] We’ll do the same thing here. Just we’re going to add it to negative_counts and total_counts. Okay, let’s check that out. So we run that. Takes a second to run because we have 25,000 reviews. Next thing we’re going to do is just sort of take a look at it. So the counter gives you this nice little convenient function. So I can say positive counts, most common, and there you go. So all right, whenever you count any words at all, the most frequent ones you get here. This doesn’t really tell me if these are indicative of things that are positive. These are just telling me whether they’re frequent words or not. So what we need to do is something that’s called normalization. So we’re not really interested in what’s the most frequent positive word. We’re interested in the word that is most frequently positive versus negative, right? because if I look at negative counts, [LAUGH] it’s the same words, right? So we want to kind of come up with some sort of ratio that is more comparative between these two lists as opposed to just these two lists by themselves. So to speed things up a little bit, I’m going to show you how I would calculate this ratio, which I also put into a counter. And if we look at positive and negative ratios, so words with a positive ratio looks kind of like this. So starting to see a little bit of signal. These are mostly names, so what I’m going to guess is that these are movie reviews, right? So people have some favorite actors, and they like to talk about them positively. So I guess it’s probably good if your last name is Caruso or Gino or something like that, right? I’m going to guess there are not very well favored actors. But I’m also guessing that my theory about work relation’s right. So maybe this isn’t true, but actors names happen. If this actor’s name was only mentioned once, or I guess at least ten times, then if it was just in one positive review, it might show up here. When we’re looking for correlation, we kind of want things that happen very frequently and have an affinity somewhere. Somebody that’s just mentioned once, 100% of them will be positive, but it’s not really indicative of being a positive feature. So let’s up this to 50 and check it out. See a bunch of names, ooh, excellently. Interesting, okay. So we the name delightfully, okay. Well, let’s up this a little bit more. As you can see, I’m investigating the data. I’m taking a look and looking for patterns and refining how I’m looking and just trying to get a feel what the day is like. Wow, so now I’m really seeing stuff. I see a few names, I see flawless, superbly, perfection, astaire, captures, wonderful. Okay, so now I’m really seeing words that I would expect to be positive words, being positively indicative of these labels. Let’s see if I look for negative, how’s that look? Pointless, atrocious, drivel, laughable, awful. Okay, great. So at this point, I’m feeling pretty good about the theory. It’s clear that the words that I would expect to be predictive seem to be predictive, or at least correlative with the kind of the labels that I think they should be correlated with. So in the kind of the next section, we’re going to be talking about how we can leverage this predictive theory to create an input and output data so our network can sort of refine this correlative power into a classifier, right? So stay tuned kind of for the next section, and I’ll see you there.