Here’s another problem that can occur. Let’s take a look at the sigmoid function. The curve gets pretty flat on the sides. So, if we calculate the derivative at a point way at the right or way at the left, this derivative is almost zero. This is not good cause a derivative is what tells us in what direction to move. This gets even worse in most linear perceptrons. Check this out. We call that the derivative of the error function with respect to a weight was the product of all the derivatives calculated at the nodes in the corresponding path to the output. All these derivatives are derivatives as a sigmoid function, so they’re small and the product of a bunch of small numbers is tiny. This makes the training difficult because basically grading the [inaudible] gives us very, very tiny changes to make on the weights, which means, we make very tiny steps and we’ll never be able to descend Mount Everest. So how do we fix it? Well, there are some ways.