# 6 – Regularization

Well the first observation is that both equations give us the same line, the line with equation X1+X2=0. And the reason for this is that solution two is really just a scalar multiple of solution one. So let’s see. Recall that the prediction is a sigmoid of the linear function. So in the first case, for the 0.11, it would be sigmoid of 1+1, which is sigmoid of 2, which is 0.88. This is not bad since the point is blue, so it has a label of one. For the point (-1, -1), the prediction is sigmoid of -1+-1, which is sigmoid of -2, which is 0.12. It’s also not best since a point label has a label of zero since it’s red. Now let’s see what happens with the second model. The point (1, 1) has a prediction sigmoid of 10 times 1 plus 10 times 1 which is sigmoid of 20. This is a 0.9999999979, which is really close to 1, so it’s a great prediction. And the point (-1, -1) has prediction sigmoid of 10 times negative one plus 10 times negative one, which is sigmoid of minus 20, and that is 0.0000000021. That’s a really, really close to zero so it’s a great prediction. So the answer to the quiz is the second model, the second model is super accurate. This means it’s better, right? Well after the last section you may be a bit reluctant since this hint’s a bit towards overfitting. And your hunch is correct. The problem is overfitting but in a subtle way. Here’s what’s happening and here’s why the first model is better even if it gives a larger error. When we apply sigmoid to small values such as X1+X2, we get the function on the left which has a nice slope to the gradient descent. When we multiply the linear function by 10 and take sigmoid of 10X1+10X2, our predictions are much better since they’re closer to zero and one. But the function becomes much steeper and it’s much harder to do great descent here. Since the derivatives are mostly close to zero and then very large when we get to the middle of the curve. Therefore, in order to do gradient descent properly, we want a model like the one in the left more than a model like the one in the right. In a conceptual way, the model in the right is too certain and it gives little room for applying gradient descent. Also as we can imagine, the points that are classified incorrectly in the model in the right, will generate large errors and it will be hard to tune the model to correct them. These can be summarized in the quote by the famous philosopher and mathematician BertrAIND Russell. The whole problem with artificial intelligence, is that bad models are so certain of themselves, and good models are so full of doubts. Now the question is, how do we prevent this type of overfitting from happening? This seems to not be easy since the bad model gives smaller errors. Well, all we have to do is we have to tweak the error function a bit. Basically we want to punish high coefficients. So what we do is we take the old error function and add a term which is big when the weights are big. There are two ways to do this. One way is to add the sums of absolute values of the weights times a constant lambda. The other one is to add the sum of the squares of the weights times that same constant. As you can see, these two are large if the weights are large. The lambda parameter will tell us how much we want to penalize the coefficients. If lambda is large, we penalized them a lot. And if lambda is small then we don’t penalize them much. And finally, if we decide to go for the absolute values, we’re doing L1 regularization, and if we decide to go for the squares, then we’re doing L2 regularization. Both are very popular, and depending on our goals or application, we’ll be applying one or the other. Here are some general guidelines for deciding between L1 and L2 regularization. When we apply L1, we tend to end up with sparse vectors. That means, small weights will tend to go to zero. So if we want to reduce the number of weights and end up with a small set, we can use L1. This is also good for feature selections and sometimes we have a problem with hundreds of features, and L1 regularization will help us select which ones are important, and it will turn the rest into zeroes. L2 on the other hand, tends not to favor sparse vectors since it tries to maintain all the weights homogeneously small. This one normally gives better results for training models so it’s the one we’ll use the most. Now let’s think a bit. Why would L1 regularization produce vectors with sparse weights, and L2 regularization will produce vectors with small homogeneous weights? Well, here’s an idea of why. If we take the vector (1, 0), the sums of the absolute values of the weights are one, and the sums of the squares of the weights are also one. But if we take the vector (0.5, 0.5), the sums of the absolute values of the weights is still one, but the sums of the squares is 0.25+0.25, which is 0.5. Thus, L2 regularization will prefer the vector point (0.5, 0.5) over the vector (1, 0), since this one produces a smaller sum of squares. And in turn, a smaller function.