## 9 – Random Restart

One way to solve this is to use random restarts, and this is just very simple. We start from a few different random places and do gradient descend from all of them. This increases the probability that we’ll get to the global minimum, or at least a pretty good local minimum.

## 8 – Local Minima

So let’s recall a gradient descent does. What it does is it looks at the direction where you descend the most and then it takes a step in that direction. But in Mt. Everest, everything was nice and pretty since that was going to help us go down the mountain. But now, what if we … Read more

## 7 – Dropout

Here’s another way to prevent overfitting. So, let’s say this is you, and one day you decide to practice sports. So, on Monday you play tennis, on Tuesday you lift weights, on Wednesday you play American football, on Thursday you play baseball, on Friday you play basketball, and on Saturday you play ping pong. Now, … Read more

## 6 – Regularization

Well the first observation is that both equations give us the same line, the line with equation X1+X2=0. And the reason for this is that solution two is really just a scalar multiple of solution one. So let’s see. Recall that the prediction is a sigmoid of the linear function. So in the first case, … Read more

## 5 – DL 53 Q Regularization

Now let me show you a subtle way of overfitting a model. Let’s look at the simplest data set in the world, two points, the point one one which is blue and the point minus one minus one which is red. Now we want to separate them with a line. I’ll give you two equations … Read more

## 4 – Model Complexity Graph

So, let’s start from where we left off, which is, we have a complicated network architecture which would be more complicated than we need but we need to live with it. So, let’s look at the process of training. We start with random weights in her first epoch and we get a model like this … Read more

## 3 – Underfitting And Overfitting

So, let’s talk about life. In life, there are two mistakes one can make. One is to try to kill Godzilla using a flyswatter. The other one is to try to kill a fly using a bazooka. What’s the problem with trying to kill Godzilla with a flyswatter? That we’re oversimplifying the problem. We’re trying … Read more

## 2 – Testing

So let’s look at the following data form by blue and red points, and the following two classification models which separates the blue points from the red points. The question is which of these two models is better? Well, it seems like the one on the left is simpler since it’s a line and the … Read more

## 15 – Error Functions Around the World

So, in this nano degree, we covered a few error functions, but there are a bunch of other error functions around the world that made the shortlist, but we didn’t have time to study them. So, here they are. These are the ones you met: there is Mount Everest and Mount Kilimanjerror. The ones you … Read more

## 14 – Momentum

So, here’s another way to solve a local minimum problem. The idea is to walk a bit fast with momentum and determination in a way that if you get stuck in a local minimum, you can, sort of, power through and get over the hump to look for a lower minimum. So let’s look at … Read more

## 13 – Learning Rate

The question of what learning rate to use is pretty much a research question itself but here’s a general rule. If your learning rate is too big then you’re taking huge steps which could be fast at the beginning but you may miss the minimum and keep going which will make your model pretty chaotic. … Read more

## 12 – Batch vs Stochastic Gradient Descent

First, let’s look at what the gradient descent algorithm is doing. So, recall that we’re up here in the top of Mount Everest and we need to go down. In order to go down, we take a bunch of steps following the negative of the gradient of the height, which is the error function. Each … Read more

## 11 – Other Activation Functions

The best way to fix this is to change the activation function. Here’s another one, the Hyperbolic Tangent, is given by this formula underneath, e to the x minus e to the minus x divided by e to the x plus e to the minus x. This one is similar to sigmoid, but since our … Read more

## 10 – Vanishing Gradient

Here’s another problem that can occur. Let’s take a look at the sigmoid function. The curve gets pretty flat on the sides. So, if we calculate the derivative at a point way at the right or way at the left, this derivative is almost zero. This is not good cause a derivative is what tells … Read more

## 1 – Training Optimization

So by now we’ve learned how to build a deep neural network and how to train it to fit our data. Sometimes however, we go out there and train on ourselves and find out that nothing works as planned. Why? Because there are many things that can fail. Our architecture can be poorly chosen, our … Read more

## 9 – XOR Perceptron

Now, I’m going to leave you with a question. Here is the XOR perceptron, which is very similar to the other two except this one returns a true if exactly one of them is true and the other one is false. So it returns this table. Now, the question is can we turn this into … Read more

## 8 – AND And OR Perceptrons

So here’s something very interesting about perceptrons and it’s that some logical operators can be represented as perceptrons. Here, for example, we have the AND operator and how does that work? The AND operator takes two inputs and it returns an output. The inputs can be true or false but the output is only true … Read more

## 7 – Why Neural Networks

So you may be wondering why are these objects called neural networks. Well, the reason why they’re called neural networks is because perceptions kind of look like neurons in the brain. In the left we have a perception with four inputs. The number is one, zero, four, and minus two. And what the perception does, … Read more

## 6 – DL 06 Perceptron Definition Fix V2

So let’s recap. We have our data which is all these students. The blue ones have been accepted and the red ones have been rejected. And we have our model which consists of the equation two times test plus grades minus 18, which gives rise to this boundary which the point where the score is … Read more

## 5 – 09 Higher Dimensions

Now, you may be wondering what happens if we have more data columns so not just testing grades, but maybe something else like the ranking of the student in the class. How do we fit three columns of data? Well the only difference is that now, we won’t be working in two dimensions, we’ll be … Read more