12 – Further Noise Reduction

So in the last section we significantly increased the speed by which our neural network. Even in the last section, seeing something in the realm of 1,500 reviews per second in our test data. I mean, just absolutely screaming. Now, in this section we’re going to go back one chapter and continue to try to reduce the amount of noise that our network has to go through and increase the amount of signal. This signal to noise ratio, the more that you can do to help the neural network cut though the really obvious steps but it can then focus on the really difficult stuff, the better your neural network will be able to train. And that’s really what we want to continue to iterate. So once again it’s about framing of the problem so the neural net can be as successful as possible. As we’ve already seen a few relatively small changes can have a drastic impact on how fast the network trains, how much it’s able to identify the underlying pattern that you care about. And we’re just going to continue to iterate. So in this section we’re going to go back to it and say, okay what is our neural network doing. This is the question that we’re always asking. What is really happening under the hood. So this is, right now we’re adding these vectors together that’s repeated in this vector, and that’s making a prediction. So this ends up being a sum of all of the words that exist in the in review. Now it’s funny. Earlier when we were doing a small little validation of our [LAUGH] of our idea. What did we do. We created this ratio where we identified what words were really important. What words had the highest correlation, flawless, superbly, perfection, or unwatchable, pointless, atrocious, redeeming, laughable. And it got me wondering. Well what if we use these to help continue to weight it more in favor where it’s like, hey, neural net. Look here. It’s okay if you look at other places too, I suppose. But start with this at least at a very minimum. Or if we just actually said, hey, this is where the gold is. Start digging. Not everyone of these words bowl and you and seagull or Segal, I guess it’s, that’s not too bad. Seagull is one of the worst. [INAUDIBLE] corded terms and don’t tell them. So here like Gandhi. That’s probably just a movie about Gandhi. I know that it really tells you much about the sentiment. So flawless, superbly, perfection, these are the ones we want the neural net to find. But if you’re thinking this is you’re digging for gold. This is a really gold rich section. There’s still some rock and some iron and stuff in here. But I mean, this stuff right here. This is what we want our neuronet to be naturally finding. So how can we actually use this statistic to help adia. I mean would that even make sense to do. Well, what if we did a cut off. Would that, we’re going to actually limit the vocabulary. Well, to investigate that, we actually want to see what the distribution is on this ratio. Now for that, visualization libraries are great. Check this out. This is our ratio. Zero was totally neutral, and the frequency kind of this distribution. So it’s normalized. But there’s a ton of words that are kind of ambiguous. Not really positive and not really negative. They’re kind of in the middle. And then over here is actually a relatively small number that are just real punchers, the ones that really, really matter. And to me, this is great because If these ones don’t really matter, if this is the a, b, periods, commas of the world and they happen all the time then I can get rid of these. And it’s going to save me tons of computational time, because because these are really frequent, but it shouldn’t effect quality negatively. In fact, if anything it should effect quality positively because they don’t have that much predictive power. So we have a ton on words in our corpus that don’t have predictive power. So let’s get rid of them. That’s just going to make our neural net that much stronger. Now before we quite jump there let’s look at one more distribution. This is the relative frequency of different terms. It’s called a Zitvian distribution normally. Our corpus is so extreme that there are a few terms that dominate. I mean like these few words are just way more frequent than all the rest. Now look at that. And to me this is interesting. And actually in natural processing it comes common trend is to eliminate the stuff that’s both most frequent and the stuff that’s most infrequent. The stuff that almost never happens. Why. Stuff that’s really frequent, like the, a, and. It happens so much that it doesn’t really give you much signal. So if it’s really infrequent, if it only happens once, that’s not a pattern. That’s just it happening once. How can you say that’s correlation when if you only see it one time. So people tend to kind of trim from both sides in actually approximating when they’re trying to make a class fire that does something really interesting. So what we’re really doing is we’re looking at these kind of broad visualizations and say hey, what is signal and what is noise. And can we use these different metrics to cut out noise and add in signal. So in here, I’m looking at this and I’m going big chunk of noise right here. Extremely frequent and not very useful. I’m going to cut this out. And I’m looking here, and I’m going, hmm, yeah. There’s a lot of stuff in here too. Now, I think these are actually, this big part is going to be in here. This is actually a better filter for getting out the really frequent stuff. But as far as this stuff, you can’t even see it because it’s so infrequent. I think we’re going to try to carve that out too. So Project six. Project six I’m really excited about. This one’s great. Project six is going to be about making learning faster by reducing noise using these statistics, these ideas. There’s no general neural net rule that says that this is how you do it. What we’re doing is we’re framing the problem to make the correlation as obvious as possible to the neural net so that it has the best chance to ignore noise and to find signal. That’s what framing the problem is all about. So in the next project, what we’re going to do is we’re going to install these metrics into our neural net. I want you to go ahead and give it a shot. See if you can build the neural net where or modify this, so that in the train method or the sediment network method, you put in a parameter that says, how much of this to carve out. You put in a parameter that says, hey, get rid of all the words that are too frequent or infrequent. And you’ll have a min count. Each word has to show up at least ten times, at least five times to be included in my vocabulary and to be included in my neural net. And see how that goes. So take a crack at that. In a minute we’ll pull up one that works and talk about it. I’ll see you there.

%d 블로거가 이것을 좋아합니다: