9 – Random Restart

One way to solve this is to use random restarts, and this is just very simple. We start from a few different random places and do gradient descend from all of them. This increases the probability that we’ll get to the global minimum, or at least a pretty good local minimum.

8 – Local Minima

So let’s recall a gradient descent does. What it does is it looks at the direction where you descend the most and then it takes a step in that direction. But in Mt. Everest, everything was nice and pretty since that was going to help us go down the mountain. But now, what if we … Read more

7 – Dropout

Here’s another way to prevent overfitting. So, let’s say this is you, and one day you decide to practice sports. So, on Monday you play tennis, on Tuesday you lift weights, on Wednesday you play American football, on Thursday you play baseball, on Friday you play basketball, and on Saturday you play ping pong. Now, … Read more

6 – Regularization

Well the first observation is that both equations give us the same line, the line with equation X1+X2=0. And the reason for this is that solution two is really just a scalar multiple of solution one. So let’s see. Recall that the prediction is a sigmoid of the linear function. So in the first case, … Read more

5 – DL 53 Q Regularization

Now let me show you a subtle way of overfitting a model. Let’s look at the simplest data set in the world, two points, the point one one which is blue and the point minus one minus one which is red. Now we want to separate them with a line. I’ll give you two equations … Read more

4 – Model Complexity Graph

So, let’s start from where we left off, which is, we have a complicated network architecture which would be more complicated than we need but we need to live with it. So, let’s look at the process of training. We start with random weights in her first epoch and we get a model like this … Read more

3 – Underfitting And Overfitting

So, let’s talk about life. In life, there are two mistakes one can make. One is to try to kill Godzilla using a flyswatter. The other one is to try to kill a fly using a bazooka. What’s the problem with trying to kill Godzilla with a flyswatter? That we’re oversimplifying the problem. We’re trying … Read more

2 – Testing

So let’s look at the following data form by blue and red points, and the following two classification models which separates the blue points from the red points. The question is which of these two models is better? Well, it seems like the one on the left is simpler since it’s a line and the … Read more

15 – Error Functions Around the World

So, in this nano degree, we covered a few error functions, but there are a bunch of other error functions around the world that made the shortlist, but we didn’t have time to study them. So, here they are. These are the ones you met: there is Mount Everest and Mount Kilimanjerror. The ones you … Read more

14 – Momentum

So, here’s another way to solve a local minimum problem. The idea is to walk a bit fast with momentum and determination in a way that if you get stuck in a local minimum, you can, sort of, power through and get over the hump to look for a lower minimum. So let’s look at … Read more

13 – Learning Rate

The question of what learning rate to use is pretty much a research question itself but here’s a general rule. If your learning rate is too big then you’re taking huge steps which could be fast at the beginning but you may miss the minimum and keep going which will make your model pretty chaotic. … Read more

12 – Batch vs Stochastic Gradient Descent

First, let’s look at what the gradient descent algorithm is doing. So, recall that we’re up here in the top of Mount Everest and we need to go down. In order to go down, we take a bunch of steps following the negative of the gradient of the height, which is the error function. Each … Read more

11 – Other Activation Functions

The best way to fix this is to change the activation function. Here’s another one, the Hyperbolic Tangent, is given by this formula underneath, e to the x minus e to the minus x divided by e to the x plus e to the minus x. This one is similar to sigmoid, but since our … Read more

10 – Vanishing Gradient

Here’s another problem that can occur. Let’s take a look at the sigmoid function. The curve gets pretty flat on the sides. So, if we calculate the derivative at a point way at the right or way at the left, this derivative is almost zero. This is not good cause a derivative is what tells … Read more

1 – Training Optimization

So by now we’ve learned how to build a deep neural network and how to train it to fit our data. Sometimes however, we go out there and train on ourselves and find out that nothing works as planned. Why? Because there are many things that can fail. Our architecture can be poorly chosen, our … Read more

9 – Random Restart

One way to solve this is to use random restarts, and this is just very simple. We start from a few different random places and do gradient descend from all of them. This increases the probability that we’ll get to the global minimum, or at least a pretty good local minimum.

8 – Local Minima

So let’s recall a gradient descent does. What it does is it looks at the direction where you descend the most and then it takes a step in that direction. But in Mt. Everest, everything was nice and pretty since that was going to help us go down the mountain. But now, what if we … Read more

7 – Dropout

Here’s another way to prevent overfitting. So, let’s say this is you, and one day you decide to practice sports. So, on Monday you play tennis, on Tuesday you lift weights, on Wednesday you play American football, on Thursday you play baseball, on Friday you play basketball, and on Saturday you play ping pong. Now, … Read more

6 – Regularization

Well the first observation is that both equations give us the same line, the line with equation X1+X2=0. And the reason for this is that solution two is really just a scalar multiple of solution one. So let’s see. Recall that the prediction is a sigmoid of the linear function. So in the first case, … Read more

5 – DL 53 Q Regularization

Now let me show you a subtle way of overfitting a model. Let’s look at the simplest data set in the world, two points, the point one one which is blue and the point minus one minus one which is red. Now we want to separate them with a line. I’ll give you two equations … Read more