12 – Mini Project 6 Solution

All right, welcome back. So we’re in project six where we’re going to be reducing noise by strategically reducing the vocabulary. So what we’ve done is we’ve taken these metrics that we’ve kind of used earlier as a juristic to see whether an idea was a good idea. And we’re going to use it to carve out a little bit of the noise so that the neural net can better see the signal. Now in this case most of the action happens in this pre-processing step, right? So instead of, we kind of updated the training before. Here we’re going to try to reduce the vocabulary to this vocab object, which gets turned into the word index object and used to create our indices. We want to reduce this in specific ways according to different cut-offs and thresholds. So, in this case, I created a polarity_cutoff and a min_count cutoff. Now what the min_count cutoff does, it says okay, so in order for a word to be included in the review_vocab, Vocab, one of those created vocab. It has to exceed a minimum count, all right? So, if it doesn’t get into vocab, it doesn’t actually get trained on according to the pre-processing step down here. So if the word Is not in the vocab, it’s not in words2index.keys. It doesn’t get added to this indices list, it doesn’t get trained on, so it gets ignored. So that’s the first thing. And the second thing is this polarity cut off. This is saying that this pos neg ratio has to be greater than or equal to the polarity cut off or less than or equal to negative the polarity cut off. because if you remember, this thing centers around one. So we’re saying a word has to either be less than the negative ratio or more than the positive ratio, right? So it’s excluding this really tall, and yet irrelevant, section in the middle. That’s the idea. One other thing I did was I required that it be a minimum count for this pos_neg_ratio to really take effect. And a reason I did that is that occasionally you have really infrequent terms that have very little correlative power or a lot of correlative power. But they only happen once or twice. So the metric, it’s a simple rule, it’s a simple cutoff. I can’t have it be, it’s perfectly positively correlated would only happen once, right? So in this case I found this was a good number. So those are the two thresholds. There’s not really a right way or a wrong way to do this. Again, we’re trying to cut out the noise and this is different for every data set, right? So you’re in kind of a new data setting and you’re trying to take a neural net and you’re going to try to frame the neural net a problem where it can succeed a data set. This is just the method you go about by framing the problem when you’re in a domain. Like these are not hard and fast rules, these are techniques to be able to make sure that your neural net has the best chance to capture new and interesting patterns. So, all of this is the same, none of these has changed at all, testing is not different, running is not different, other than the real vocabulary is smaller. So let’s give it a shot, I’m going to set the min count at 20 and the polarity cut off at .05 which is a really small polarity cutoff, like this doesn’t cut out too much of the vocabulary. We can get more aggressive about it later, and yeah, we’ll see how this thing goes here in a second. So what it’s doing right now is it’s calculating all the statistics as it started, so we had do it then, and then it’s going to start training. So yeah, so when you call the constructor, calls pre-processed data and pre-processed data runs all these stakes from before to help kind of identify signal we’re going to train. All right, very. Very good. [BLANK_AUDIO] So this is actually a little bit of a speed lift from before. Not a lot, just a little bit, because we’ve reduced our vocabulary size. And then we look at our testing data, 85.9. Not a huge lift, but a small and significant lift. Like, and you can pick your battles and I didn’t do this too long. So you kind of keep playing with this min count and polarity cut off and find which one seems to work the best for you. Now the other thing that I wanted to show here is that if we take this and just crank it up, right? So, if we set this to be more like, I don’t know. We’ll keep min count the same. We’ll set this to be .8. We really carve out the middle of this guy. We’re just going to take a big chunk. .8, that’s like right here to right here, and we’re going to say, you know what? This thing stuff has little correlation, but I just think it’s ambiguous, so just carve it all out. And we say, okay, we’re going to train on that. Then we can really crank up the speed of this thing. The speed by which it trains, with a relatively small effect on our testing accuracy. So sometimes you’ll find that this can be beneficial for two reasons. For some problems you want to be able to train on so much more data, you want to be able to run a lot faster. Let’s see how much faster it goes. Straining, my gosh, look at that. So 7,000 reviews per second. I mean, it trained whole thing in a couple seconds. And let’s see how, let’s see the damages in the test. 82.2, so we lost 3%. Not bad, we got a 7x increase in speed for a, well, almost 7x, for a 3% decrease in quality. Now, you’re going to find in neural nets there’s usually a speed trade out quality. The only time that I know of when there consistently isn’t is when you can reduce noise. because reducing noise tends to reduce the amount of data that you have to process, which increases both speed and accuracy. It’s why I like it so much. But this can actually be really beneficial. One, for production systems. When you just need something to fly, to get through a ton of data. To really solve a real world problem that exists out there. Like the most laboratory kind of sterile, highest possible score isn’t always the best answer. Sometimes the best answer is hey, we want to fly through a million medical reports a day so that we can save the most lives. Give me something that’s faster. And this is the something that you can do to try to do that. And once again, we got almost 10X and if we cranked it up even further, we probably could. And so the cool thing about this is that the more you carve out you’re saying, hey, I need to go faster and I’m trying to do the minimum amount of damage that I can to go faster. So I’m getting rid of the terms that are most ambiguous in my prediction and it’s saving me lots of time. So now why is it saving me time? Well, we do fewer sums. When we have fewer terms we do fewer sums in the display matrix and we do fewer back propagation in the display matrix. Because there are fewer words in the review that we’ve allowed to pass through that were candidate words that we could train on. That’s just all we did, and fortunately it increases the speed very significantly. Now, the other reason that sometimes people will increase the speed like this or in a variety of other ways is if you have so much training data that you could never train over all of it. It turns out that just being able to run faster over more data, even if you choose a more naive algorithm or your limiting a lot of stuff, makes the accuracy really, really high. The most famous example of this that maybe people don’t know is taking its effect is Word2Vec. So Word2Vec is a close approximation of other language models that people have been training forever. But the problem was other language models took like a month to train on a billion tokens and even then it wasn’t making as aggressive weight updates. What Word2Vec showed was that if you striped out everything and you skip some steps, yeah it’s an approximation, but It’s so much faster. You can train it on, was it eight billion tokens in like six hours on a 32 core machine? I mean it’s so much faster that you can train so much data. You can gather so much more information that even though the backhoe is dropping dirt while it’s driving back to the dumpster, it can do it so much faster, in so many more iterations, it just doesn’t matter. So, there’s a lot of cases where we might make it way faster and then we’ll train on seven times more data, and then that’ll actually get us to a higher score. Actually I can almost guarantee you that if we had seven times more data and we trained this with something that’s seven times faster relative to this, it’s going to get a way higher score. Because once again, you’re covering more ground. We lost a marginal amount of accuracy here. So if training over more data was the problem, then this can often be a really, really good solution. So, I hope you’ve enjoyed and in the next section we’re going to talk a little bit more about what’s actually happening under the hood in the weights. How are they adjusting to each other, what’s happening on the terms, and really see some kind of cool visualizations that help us get a more intuitive feeling for what’s going on in the neural net. See you there.

%d 블로거가 이것을 좋아합니다: