12 – Batch vs Stochastic Gradient Descent

First, let’s look at what the gradient descent algorithm is doing. So, recall that we’re up here in the top of Mount Everest and we need to go down. In order to go down, we take a bunch of steps following the negative of the gradient of the height, which is the error function. Each step is called an epoch. So, when we refer to the number of steps, we refer to the number of epochs. Now, let’s see what happens in each epoch. In each epoch, we take our input, namely all of our data and run it through the entire neural network. Then we find our predictions, we calculate the error, namely, how far they are from where their actual labels. And finally, we back-propagate this error in order to update the weights in the neural network. This will give us a better boundary for predicting our data. Now this is done for all the data. If we have many, many data points, which is normally the case, then these are huge matrix computations, I’d use tons and tons of memory and all that just for a single step. If we had to do many steps, you can imagine how this would take a long time and lots of computing power. Is there anything we can do to expedite this? Well, here’s a question: do we need to plug in all our data every time we take a step? If the data is well distributed, it’s almost like a small subset of it would give us a pretty good idea of what the gradient would be. Maybe it’s not the best estimate for the gradient but it’s quick, and since we’re iterating, it may be a good idea. This is where stochastic gradient descent comes into play. The idea behind stochastic gradient descent is simply that we take small subsets of data, run them through the neural network, calculate the gradient of the error function based on those points and then move one step in that direction. Now, we still want to use all our data, so, what we do is the following; we split the data into several batches. In this example, we have 24 points. We’ll split them into four batches of six points. Now we take the points in the first batch and run them through the neural network, calculate the error and its gradient and back-propagate to update the weights. This will give us new weights, which will define a better boundary region as you can see on the left. Now, we take the points in the second batch and we do the same thing. This will again give us better weights and a better boundary region. Now, we do the same thing for the third batch. And finally, we do it for the fourth batch and we’re done. Notice that with the data, we took four steps whereas, when we did normal gradient descent, we took only one step with all the data. Of course, the four steps we took were less accurate but in the practice, it’s much better to take a bunch of slightly inaccurate steps than to take one good one. Later in this nanodegree, you’ll have the chance to apply stochastic gradient descents and really see the benefits of it.

%d 블로거가 이것을 좋아합니다: