So, let’s start from where we left off, which is, we have a complicated network architecture which would be more complicated than we need but we need to live with it. So, let’s look at the process of training. We start with random weights in her first epoch and we get a model like this one, which makes lots of mistakes. Now as we train, let’s say for 20 epochs we get a pretty good model. But then, let’s say we keep going for a 100 epochs, we’ll get something that fits the data much better, but we can see that this is starting to over-fit. If we go for even more, say 600 epochs, then the model heavily over-fits. We can see that the blue region is pretty much a bunch of circles around the blue points. This fits the training data really well, but it will generalize horribly. Imagine a new blue point in the blue area. This point will most likely be classified as red unless it’s super close to a blue point. So, let’s try to evaluate these models by adding a testing set such as these points. Let’s make a plot of the error in the training set and the testing set with respect to each epoch. For the first epoch, since the model is completely random, then it badly misclassifies both the training and the testing sets. So, both the training error and the testing error are large. We can plot them over here. For the 20 epoch, we have a much better model which fit the training data pretty well, and it also does well in the testing set. So, both errors are relatively small and we’ll plot them over here. For the 100 epoch, we see that we’re starting to over-fit. The model fits the data very well but it starts making mistakes in the testing data. We realize that the training error keeps decreasing, but the testing error starts increasing, so, we plot them over here. Now, for the 600 epoch, we’re badly over-fitting. We can see that the training error is very tiny because the data fits the training set really well but the model makes tons of mistakes in the testing data. So, the testing error is large. We plot them over here. Now, we draw the curves that connect the training and testing errors. So, in this plot, it is quite clear when we stop under-fitting and start over-fitting, the training curve is always decreasing since as we train the model, we keep fitting the training data better and better. The testing error is large when we’re under-fitting because the model is not exact. Then it decreases as the model generalizes well until it gets to a minimum point – the Goldilocks spot. And finally, once we pass that spot, the model starts over-fitting again since it stops generalizing and just starts memorizing the training data. This plot is called the model complexity graph. In the Y-axis, we have a measure of the error and in the X-axis we have a measure of the complexity of the model. In this case, it’s the number of epochs. And as you can see, in the left we have high testing and training error, so we’re under-fitting. In the right, we have a high testing error and low training error, so we’re over-fitting. And somewhere in the middle, we have our happy Goldilocks point. So, this determines the number of epochs we’ll be using. So, in summary, what we do is, we degrade in descent until the testing error stops decreasing and starts to increase. At that moment, we stop. This algorithm is called Early Stopping and is widely used to train neural networks.