17 – ROC Curve

Now we’ll learn another technique to evaluate a model called the receiver operator characteristic curve, or ROC curve for short. It works as follows; consider this data which is now one dimensional, so all the red and blue points lie in one line and we want to find the correct split. So, we can have a split around here or maybe here or here, all of them are good splits. So, we’ll call this a good split. Now we can look at this data, which as you can see is perfectly separable over here. So we’ll call that a perfect split. Finally, we have this data over here which is pretty much random and there’s not much to split here. It seemed that anywhere we put the boundary, we’ll have about half blue, half red points on each side. So we’ll call that a bad split or a random split. Now what we want is to come up with a metric or some number that is high for the perfect split, medium for the good split, and low for the random split. In fact, something that gives the perfect split a score of 1.0, the good split something around 0.8, and the random split something around 0.5. That’s where the ROC curve will help us. So let’s see how to construct these numbers. Let’s take our good data and let’s cut it over here. Now, we’ll calculate two ratios. The first one is a true positive rate, which means out of all the positively labeled points, how many did we classify correctly? That means the number of true positives divided by the total number of positively labeled points. So let’s see how much this is. There are seven positively labeled numbers and six of them have been correctly labeled positive, so this ratio is six out of seven or 0.857. Now let’s look at the false positive rate, which means out of all the negative points, how many of them did the model incorrectly think they were positives? So out of the seven negatively labeled points, the model thought two of them were positive. So the false positive rate is two out of seven or 0.286. We’ll just remember these two numbers. Now what we’ll do is we’ll move this boundary around and calculate the same pair of numbers. So let’s split over here. What is the true positive rate over here? Well, the model thinks everything is positive. So in particular, all the positives are true positives. So the true positive rate is 7 divided by 7, which is one. For the false positive rate, well, since the model thinks everything is positive, then all the negatives are false positive. So the false positive rate is again 7 divided by 7, which is one. So again, we’ll remember these two values, one and one. Now, let’s go to the other extreme. Let’s put the bar over here and now let’s see what is the true positive rate. Well, the model thinks nothing is positive so in particular, there are no true positives and the ratio is 0 divided by 7, which is zero. For the false positive rate, well, again, the model thinks nothing is positive, so there are no false positives and the ratio is zero over seven, which again is zero. We’ll remember these two numbers. We can see that no matter how the data looks, the two extremes will always be one, one and zero, zero. Now, we can do this for every possible split and record those numbers. So here we have a few of them that we’ve calculated. Now, the magic happens. We just plot these numbers in the plane and we get a curve. Now, we calculate the area under the curve and here we get around 0.8. This is actually not accurate but it’s around there. You can calculate on your own and see how much you get. So now, let’s do the same thing for the perfect split. Here are all the ratios. Notice that if the boundary is on the red side, then the true positive ratio is one since every positive number has been predicted positive. Similarity, if the boundary is on the blue side, then every negative number has been predicted negative and so the false positive ratio is zero. In particular, at the perfect split point, we have a zero, one. Thus, when we plot these numbers, the curve looks like a square and the square has area, one, which means the area under the ROC curve for the perfect split is one. Finally, we do this for the random split. In here you can try it on your own, but basically since every split leaves on each side around half blue, half red points, then each pair of numbers will be close to each other, and the curve will be very close to being just a diagonal between zero, zero and one, one. So if the model is random, then the area under the ROC curve is around 0.5. So to summarize, we have three possible scenarios; some random data which is hard to split, some pretty good data which we can split well making some errors, and some perfectly divided data which we can split with no errors. Each one is associated with a curve. The areas under the curve are close to 0.5 for the random model, somewhere close to one for the good model, so around 0.8, and one for the perfect model. So in summary, the closer your area under the ROC curve is to one, the better your model is. Now, here is a question, can the area under the curve be less than 0.5? In fact, yes. It can be all the way to zero. How would a model look if the area under the curve is zero? Well, it will look more backwards. It’ll have more blue points in the red area and the red points in the blue area, so maybe flipping the data may help.

%d 블로거가 이것을 좋아합니다: