The best way to fix this is to change the activation function. Here’s another one, the Hyperbolic Tangent, is given by this formula underneath, e to the x minus e to the minus x divided by e to the x plus e to the minus x. This one is similar to sigmoid, but since our range is between minus one and one, the derivatives are larger. This small difference actually led to great advances in neural networks, believe it or not. Another very popular activation function is the Rectified Linear Unit or ReLU. This is a very simple function. It only says, if you’re positive, I’ll return the same value, and if your negative, I’ll return zero. Another way of seeing it is as the maximum between x and zero. This function is used a lot instead of the sigmoid and it can improve the training significantly without sacrificing much accuracy, since the derivative is one if the number is positive. It’s fascinating that this function which barely breaks linearity can lead to such complex non-linear solutions. So now, with better activation functions, when we multiply derivatives to obtain the derivative to any sort of weight, the products will be made of slightly larger numbers which will make the derivative less small, and will allow us to do gradient descent. We’ll represent the ReLU unit by the drawing of it’s function. Here’s an example of a Multi-layer Perceptron with a bunch of ReLU activation units. Note that the last unit is a sigmoid, since our final output still needs to be a probability between zero and one. However, if we let the final unit be a ReLU, we can actually end up with regression models, the predictive value. This will be of use in the recurring neural network section of the Nanodegree.