# 4-6-1-9. Network Architectures in PyTorch

Hello, everyone, and welcome back.

So, in this video and in this notebook,

I’ll be showing you how to actually train neural networks in PyTorch.

So, previously, we saw how to define neural networks in PyTorch using the nn module,

but now we’re going to see how we actually take

one of these networks that we defined and train it.

So, what I mean by training is that we’re going to use

our neural networks as a universal function approximator.

What that means is that,

for basically any function,

we have some desired input for example,

an image of the number four,

and then we have some desired output of this function.

In this case a probability distribution that

is telling us the probabilities of the various digits.

So, in this case,

if we passed it in image four,

we want to get out a probability distribution where there’s

a lot of probability in the digit four.

So, the cool thing about neural networks is that if

you use non-linear activations and then

you have the correct dataset of these images that are labeled with the correct ones,

then basically you pass in an image and the correct output,

the correct label or class,

and eventually your neural network will build to approximate this function that is

converting these images into this probability distribution, and that’s our goal here.

So, basically we want to see how in PyTorch,

we can build a neural network and then we’re going to give it the inputs and outputs,

and then adjust the weights of that network so that it approximates this function.

So, the first thing that we need for that is what is called a loss function.

So, it’s sometimes also called the cost,

and what this is it’s a measure of our prediction error.

So, we pass in the image of a four and then

our network predicts something else that’s an error.

So, we want to measure how far away our networks prediction is from the correct label,

and we do that using loss function.

So, in this case, it’s the mean squared error.

So, a lot of times you’ll use this in regression problems,

but use other loss functions and classification problems like this one here.

So, the loss depends on

the output of our network or the predictions our network is making.

The output of a network depends on the weight.

So, like the network parameters.

So, we can actually adjust our weights such that this loss is minimized,

and once the loss is minimized,

then we know that our network is making as good predictions as it can.

So, this is the whole goal to adjust our network parameters to minimize our loss,

and we do this by using a process called gradient descent.

So, the gradient is the slope of the loss function with respect to our perimeters.

The gradient always points in the direction of fastest change.

So, for example if you have a mountain,

the gradient is going to always point up the mountain.

So, you can imagine our loss function being like

this mountain where we have a high loss up here and we have a low loss down here.

So, we know that we want to get to the minimum of our loss when we minimize our loss,

and so, we want to go downwards.

So, basically, the gradient points upwards and so,

we just go the opposite direction.

So, we go in the direction of the negative gradient,

and then if we keep following this down,

then eventually we get to the bottom of this mountain, the lowest loss.

With multilayered neural networks,

we use an algorithm called backpropagation to do this.

Backpropagation is really just an application of the chain rule from calculus.

So, if you think about it when we pass in some data,

some input into our network,

it goes through this forward pass through the network to calculate our loss.

So, we pass in some data,

some feature input x and then it goes through

this linear transformation which depends on our weights and biases,

and then through some activation function like a sigmoid,

through another linear transformation with some more weights and biases,

and then that goes in,

from that we can calculate our loss.

So, if we make a small change in our weights here, W1,

it’s going to propagate through the network and

end up like results in a small change in our loss.

So, you can think of this as a chain of changes.

So, if we change here, this is going to change.

Even that’s going to propagate through here,

it’s going to propagate through here, it’s going to propagate through here.

So, with backpropagation, we actually use these same changes,

but we go in the opposite direction.

So, for each of these operations like the loss and

the linear transformation into the sigmoid activation function,

there’s always going to be some derivative,

some gradient between the outputs and inputs, and so,

what we do is we take each of the gradients for

these operations and we pass them backwards through the network.

Each step we multiply the incoming gradient with the gradient of the operation itself.

So, for example just starting at the end with the loss.

So, we pass this gradient or the loss dldL2.

So, this is the gradient of the loss with respect to the second linear transformation,

and then we pass that backwards again and if we multiply it by the loss of this L2.

So, this is the linear transformation with respect to

the outputs of our activation function,

that gives us the gradient for this operation.

If you multiply this gradient by the gradient coming from the loss,

then we get the total gradient for both of these parts,

and this gradient can be passed back to this softmax function.

So, as the general process for backpropagation, we take our gradients,

we pass it backwards to the previous operation,

multiply it by the gradient there,

and then pass that total gradient backwards.

So, we just keep doing that through each of the operations in our network,

and eventually we’ll get back to our weights.

What this does is it allows us to calculate

the gradient of the loss with respect to these weights.

Like I was saying before,

the gradient points in the direction of fastest change in our loss,

so, to maximize it.

So, if we want to minimize our loss,

we can subtract the gradient off from our weights,

and so, what this will do is it’ll give us a new set of weights

that will in general result in a smaller loss.

So, the way that backpropagation algorithm works is that it will

make a forward pass through a network, calculate the loss,

and then once we have the loss, we can go

backwards through our network and calculate the gradient,

and get the gradient for a weights.

Then we’ll update the weights.

Do another forward pass,

calculate the loss, do another backward pass, update the weights,

and so on and so on and so on,

until we get sufficiently minimized loss.

So, once we have the gradient and like I was saying before,

we can subtract it off from our weights,

but we also use this term Alpha which is called the learning rate.

This is basically just a way to scale our gradients so that we’re not

taking too large steps in this iterative process.

So, what can happen if you’re update steps are too large,

you can bounce around in this trough around

the minimum and never actually settle in the minimum of the loss.

So, let’s see how we can actually calculate losses in PyTorch.

Again using the nn module,

PyTorch provides us a lot of different losses including the cross-entropy loss.

So, this loss is what we’ll typically use when we’re doing classification problems.

In PyTorch, the convention is to assign our loss to its variable criterion.

So, if we wanted to use cross-entropy,

we just say criterion equals nn.crossEntropyLoss and create that class.

So, one thing to note is that,

if you look at the documentation for cross-entropy loss,

you’ll see that it actually wants the scores

like the logits of our network as the input to the cross-entropy loss.

So, you’ll be using this with an output such as softmax,

which gives us this nice probability distribution.

But for computational reasons,

then it’s generally better to use the logits which are

the input to the softmax function as the input to this loss.

So, the input is expected to be the scores

for each class and not the probabilities themselves.

So, first I’m going to import the necessary modules

like you’ve seen before, as a trainloader,

and so, we can get our data out of here.

So, here I’m defining a model.

So, I’m using nn.Sequential, and if you haven’t seen this,

checkout the end of the previous notebook.

So, the end of part two,

will show you how to use nn.Sequential.

It’s just a somewhat more concise way to define simple feed-forward networks, and so,

you’ll notice here that I’m actually only returning the logits,

the scores of our output function and not the softmax output itself.

Then here we can define our loss.

So, criterions equal to nn.crossEntropyLoss.

We get our data with images and labels,

flatten it, pass it through our model to get the logits,

and then we can get the actual loss by bypassing in our logits and the true labels,

and so, again we get the labels from our trainloader.

So, if we do this, we see we have calculated the loss.

So, my experience, it’s more convenient to build your model

using a log-softmax output instead of just normal softmax.

So, with a log-softmax output to get the actual probabilities,

you just pass it through torch.exp. So, the exponential.

With a log-softmax output,

you’ll want to use the negative log-likelihood loss or nn.NLLLoss.

So, what I want you to do here is build

a model that returns the log-softmax as the output,

and calculate the loss using the negative log-likelihood loss.

When you’re using log-softmax,

make sure you pay attention to the dim keyword argument.

You want to make sure you set it right so that the output is what you want.

So, go and try this and feel free to check out my solution.

It’s in the notebook and also in the next video,

if you’re having problems. Cheers.

안녕하세요, 여러분, 그리고 다시 오신 것을 환영합니다.

그래서 이 영상과 이 노트에서

PyTorch에서 실제로 신경망을 훈련하는 방법을 보여 드리겠습니다.

그래서 이전에 nn 모듈을 사용하여 PyTorch에서 신경망을 정의하는 방법을 보았습니다.

하지만 이제 우리는 실제로

우리가 정의하고 훈련한 이러한 네트워크 중 하나입니다.

그래서 제가 말하는 훈련이란 우리가

우리의 신경망을 범용 함수 근사기로 사용합니다.

그것이 의미하는 바는,

기본적으로 모든 기능에 대해

예를 들어 원하는 입력이 있습니다.

숫자 4의 이미지,

그러면 이 함수의 원하는 출력이 있습니다.

이 경우 확률 분포는

다양한 숫자의 확률을 알려줍니다.

따라서 이 경우,

이미지 4에서 전달했다면

우리는 확률 분포를 얻고 싶습니다.

4자리 확률이 높습니다.

신경망의 멋진 점은

비선형 활성화를 사용한 다음

올바른 이미지로 레이블이 지정된 이미지의 올바른 데이터 세트가 있습니다.

그런 다음 기본적으로 이미지와 올바른 출력을 전달합니다.

올바른 레이블 또는 클래스,

그리고 결국 당신의 신경망은 이 함수를 근사화하기 위해 구축될 것입니다.

이 이미지를 이 확률 분포로 변환하는 것이 우리의 목표입니다.

따라서 기본적으로 PyTorch에서

신경망을 구축한 다음 입력과 출력을 제공합니다.

그런 다음 해당 네트워크의 가중치를 조정하여 이 함수에 근접하도록 합니다.

따라서 이를 위해 가장 먼저 필요한 것은 손실 함수라고 하는 것입니다.

그래서 비용이라고도 합니다.

그리고 이것은 우리의 예측 오차를 측정한 것입니다.

그래서 우리는 4의 이미지를 전달한 다음

우리 네트워크는 오류인 다른 것을 예측합니다.

따라서 네트워크 예측이 올바른 레이블에서 얼마나 멀리 떨어져 있는지 측정하려고 합니다.

그리고 우리는 손실 함수를 사용하여 그렇게 합니다.

따라서 이 경우에는 평균 제곱 오차입니다.

따라서 회귀 문제에서 이것을 사용하는 경우가 많습니다.

그러나 여기에서 이와 같은 다른 손실 함수와 분류 문제를 사용하십시오.

따라서 손실에 따라

우리 네트워크의 출력 또는 우리 네트워크가 만드는 예측.

네트워크의 출력은 가중치에 따라 다릅니다.

따라서 네트워크 매개 변수와 같습니다.

따라서 이 손실이 최소화되도록 실제로 가중치를 조정할 수 있습니다.

손실이 최소화되면

그러면 네트워크가 가능한 한 좋은 예측을 하고 있다는 것을 알 수 있습니다.

따라서 이것이 손실을 최소화하기 위해 네트워크 매개변수를 조정하는 전체 목표입니다.

그래디언트 디센트라는 프로세스를 사용하여 이 작업을 수행합니다.

따라서 기울기는 둘레에 대한 손실 함수의 기울기입니다.

기울기는 항상 가장 빠른 변화 방향을 가리킵니다.

예를 들어 산이 있는 경우

기울기는 항상 산을 가리킬 것입니다.

따라서 손실 함수가 다음과 같다고 상상할 수 있습니다.

여기에서는 손실이 높고 여기에서는 낮은 손실이 있는 이 산입니다.

그래서 우리는 손실을 최소화할 때 손실을 최소화하기를 원한다는 것을 압니다.

그래서 우리는 아래로 가고 싶습니다.

따라서 기본적으로 그라데이션은 위쪽을 가리키며

우리는 단지 반대 방향으로 갑니다.

따라서 음의 기울기 방향으로 이동합니다.

그런 다음 이 내용을 계속 따라가면

그리고 결국 우리는 가장 낮은 손실인 이 산의 바닥에 도달합니다.

다층 신경망으로,

이를 위해 backpropagation이라는 알고리즘을 사용합니다.

역전파는 실제로 미적분학의 연쇄 법칙을 적용한 것입니다.

따라서 일부 데이터를 전달할 때 생각해 보면,

네트워크에 대한 일부 입력,

손실을 계산하기 위해 네트워크를 통해 이 순방향 통과를 거칩니다.

그래서 우리는 일부 데이터를 전달합니다.

일부 기능 입력 x 및 다음을 통과합니다.

우리의 가중치와 편향에 의존하는 이 선형 변환,

그런 다음 시그모이드와 같은 활성화 함수를 통해

더 많은 가중치와 편향이 있는 또 다른 선형 변환을 통해

그런 다음 들어갑니다.

그것으로부터 우리는 우리의 손실을 계산할 수 있습니다.

따라서 여기서 W1의 가중치를 약간 변경하면

네트워크를 통해 전파되고

결국 우리의 손실에 작은 변화를 가져옵니다.

따라서 이것을 일련의 변화라고 생각할 수 있습니다.

여기서 우리가 바뀌면 이것은 바뀔 것입니다.

그마저도 여기까지 퍼진다.

여기를 통해 전파될 것입니다. 여기를 통해 전파될 것입니다.

따라서 역전파를 사용하면 실제로 이와 동일한 변경 사항을 사용합니다.

그러나 우리는 반대 방향으로 간다.

따라서 손실 및

시그모이드 활성화 함수로의 선형 변환,

항상 파생 상품이 있습니다.

출력과 입력 사이의 약간의 기울기 등,

우리가 하는 것은 각 그라디언트를 취하는 것입니다.

이러한 작업을 네트워크를 통해 역방향으로 전달합니다.

각 단계마다 들어오는 기울기에 작업 자체의 기울기를 곱합니다.

예를 들어 손실로 끝에서 시작합니다.

따라서 이 기울기 또는 손실 dldL2를 전달합니다.

따라서 이것은 두 번째 선형 변환에 대한 손실의 기울기입니다.

그런 다음 그것을 다시 거꾸로 전달하고 이 L2의 손실을 곱하면 됩니다.

따라서 이것은 에 대한 선형 변환입니다.

활성화 함수의 출력,

이 작업에 대한 기울기를 제공합니다.

이 기울기에 손실에서 오는 기울기를 곱하면

그런 다음 이 두 부분에 대한 총 기울기를 얻습니다.

이 그래디언트는 이 softmax 함수로 다시 전달될 수 있습니다.

따라서 역전파의 일반적인 과정으로 기울기를 취합니다.

우리는 그것을 이전 작업으로 뒤로 전달합니다.

거기에 그라디언트를 곱하고,

그런 다음 전체 기울기를 뒤로 전달합니다.

그래서 우리는 계속해서 네트워크의 각 작업을 통해

그리고 결국 우리는 우리의 무게로 돌아갈 것입니다.

이것이 하는 일은 우리가 계산할 수 있게 해주는 것입니다.

이 가중치에 대한 손실의 기울기.

내가 전에 말했듯이,

손실에서 가장 빠르게 변화하는 방향의 기울기 점,

그래서 그것을 극대화하기 위해.

따라서 손실을 최소화하려면

가중치에서 기울기를 뺄 수 있습니다.

이것은 우리에게 새로운 가중치 세트를 제공할 것입니다.

일반적으로 손실이 적습니다.

따라서 역전파 알고리즘이 작동하는 방식은

네트워크를 통해 정방향 통과, 손실 계산,

일단 손실이 나면 갈 수 있습니다.

우리의 네트워크를 통해 역방향으로 기울기를 계산하고,

가중치에 대한 그라디언트를 가져옵니다.

그런 다음 가중치를 업데이트합니다.

또 다른 전진 패스를 하고,

손실을 계산하고, 다른 역방향 패스를 수행하고, 가중치를 업데이트하고,

등등 등등 등등,

충분히 최소화된 손실을 얻을 때까지.

그래디언트가 있고 이전에 말했듯이

우리는 우리의 무게에서 그것을 뺄 수 있습니다.

그러나 우리는 학습률이라고 하는 알파라는 용어도 사용합니다.

이것은 기본적으로 그라디언트의 크기를 조정하여

이 반복적인 과정에서 너무 큰 단계를 거칩니다.

업데이트 단계가 너무 크면 어떻게 될까요?

당신은 이 물마루에서 돌아다닐 수 있습니다

손실의 최소값에 실제로 정착하지 않습니다.

이제 PyTorch에서 손실을 실제로 계산하는 방법을 살펴보겠습니다.

다시 nn 모듈을 사용하여,

PyTorch는 교차 엔트로피 손실을 포함하여 다양한 손실을 제공합니다.

따라서 이 손실은 분류 문제를 수행할 때 일반적으로 사용하는 것입니다.

PyTorch에서 관례는 손실을 가변 기준에 할당하는 것입니다.

따라서 교차 엔트로피를 사용하려면

우리는 기준이 nn.crossEntropyLoss와 같다고 말하고 해당 클래스를 생성합니다.

따라서 한 가지 주의할 점은,

교차 엔트로피 손실에 대한 문서를 보면

실제로 점수를 원한다는 것을 알 수 있습니다.

교차 엔트로피 손실에 대한 입력으로 네트워크의 로짓과 같습니다.

그래서 이것을 softmax와 같은 출력과 함께 사용할 것입니다.

이것은 우리에게 좋은 확률 분포를 제공합니다.

하지만 계산상의 이유로,

그렇다면 일반적으로 다음과 같은 로짓을 사용하는 것이 좋습니다.

softmax 함수에 대한 입력을 이 손실에 대한 입력으로 사용합니다.

따라서 입력은 점수가 될 것으로 예상됩니다.

확률 자체가 아니라 각 클래스에 대해.

먼저 필요한 모듈을 가져올 것입니다.

여기에서 데이터를 다운로드하고 생성합니다.

이전에 본 것처럼 기차로더로서

여기에서 데이터를 가져올 수 있습니다.

여기에서 모델을 정의하고 있습니다.

그래서 저는 nn.Sequential을 사용하고 있습니다.

이전 노트북의 끝을 확인하십시오.

그래서 2부 끝,

nn.Sequential을 사용하는 방법을 보여줍니다.

단순한 피드포워드 네트워크를 정의하는 좀 더 간결한 방법입니다.

여기서 내가 실제로 로지트만 반환한다는 것을 알 수 있습니다.

softmax 출력 자체가 아니라 출력 함수의 점수입니다.

그런 다음 여기에서 손실을 정의할 수 있습니다.

따라서 기준은 nn.crossEntropyLoss와 같습니다.

우리는 이미지와 레이블로 데이터를 얻습니다.

그것을 평평하게 하고, 로짓을 얻기 위해 모델을 통해 전달합니다.

그런 다음 로짓과 실제 레이블을 우회하여 실제 손실을 얻을 수 있습니다.

따라서 이렇게 하면 손실이 계산되었음을 알 수 있습니다.

따라서 내 경험에 따르면 모델을 만드는 것이 더 편리합니다.

일반 softmax 대신 log-softmax 출력을 사용합니다.

따라서 실제 확률을 얻기 위한 log-softmax 출력으로,

당신은 그냥 토치.exp를 통해 그것을 전달합니다. 그래서 지수.

log-softmax 출력으로,

음수 log-likelihood loss 또는 nn.NLLLoss를 사용하고 싶을 것입니다.

자, 여기서 하고 싶은 것은 빌드입니다.

log-softmax를 출력으로 반환하는 모델,

음의 로그 가능성 손실을 사용하여 손실을 계산합니다.

log-softmax를 사용할 때,

dim 키워드 인수에 주의를 기울이십시오.

출력이 원하는 대로 되도록 올바르게 설정했는지 확인하고 싶습니다.

그러니 가서 이것을 시도하고 내 솔루션을 자유롭게 확인하십시오.

수첩에도 있고 다음 영상에도 있습니다.

문제가 있는 경우. 건배.

# Training Neural Networks

The network we built in the previous part isn’t so smart, it doesn’t know anything about our handwritten digits. Neural networks with non-linear activations work like universal function approximators. There is some function that maps your input to the output. For example, images of handwritten digits to class probabilities. The power of neural networks is that we can train them to approximate this function, and basically any function given enough data and compute time.

At first the network is naive, it doesn’t know the function mapping the inputs to the outputs. We train the network by showing it examples of real data, then adjusting the network parameters such that it approximates this function.

To find these parameters, we need to know how poorly the network is predicting the real outputs. For this we calculate a loss function (also called the cost), a measure of our prediction error. For example, the mean squared loss is often used in regression and binary classification problems

where $n$ is the number of training examples, $y_i$ are the true labels, and $\hat{y}_{i}$ are the predicted labels.

By minimizing this loss with respect to the network parameters, we can find configurations where the loss is at a minimum and the network is able to predict the correct labels with high accuracy. We find this minimum using a process called gradient descent. The gradient is the slope of the loss function and points in the direction of fastest change. To get to the minimum in the least amount of time, we then want to follow the gradient (downwards). You can think of this like descending a mountain by following the steepest slope to the base.

## Backpropagation

For single layer networks, gradient descent is straightforward to implement. However, it’s more complicated for deeper, multilayer neural networks like the one we’ve built. Complicated enough that it took about 30 years before researchers figured out how to train multilayer networks.

Training multilayer networks is done through backpropagation which is really just an application of the chain rule from calculus. It’s easiest to understand if we convert a two layer network into a graph representation.

In the forward pass through the network, our data and operations go from bottom to top here. We pass the input 𝑥x through a linear transformation $L_1$ with weights $W_1$ and biases $b_1$. The output then goes through the sigmoid operation $S$ and another linear transformation $L_2$. Finally we calculate the loss $l$. We use the loss as a measure of how bad the network’s predictions are. The goal then is to adjust the weights and biases to minimize the loss.

To train the weights with gradient descent, we propagate the gradient of the loss backwards through the network. Each operation has some gradient between the inputs and outputs. As we send the gradients backwards, we multiply the incoming gradient with the gradient for the operation. Mathematically, this is really just calculating the gradient of the loss with respect to the weights using the chain rule.

Note: I’m glossing over a few details here that require some knowledge of vector calculus, but they aren’t necessary to understand what’s going on.

We update our weights using this gradient with some learning rate $α$.

The learning rate $α$ is set such that the weight update steps are small enough that the iterative method settles in a minimum.

## Losses in PyTorch

Let’s start by seeing how we calculate the loss with PyTorch. Through the nn module, PyTorch provides losses such as the cross-entropy loss (nn.CrossEntropyLoss). You’ll usually see the loss assigned to criterion. As noted in the last part, with a classification problem such as MNIST, we’re using the softmax function to predict class probabilities. With a softmax output, you want to use cross-entropy as the loss. To actually calculate the loss, you first define the criterion then pass in the output of your network and the correct labels.

Something really important to note here. Looking at the documentation for nn.CrossEntropyLoss,

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

The input is expected to contain scores for each class.

This means we need to pass in the raw output of our network into the loss, not the output of the softmax function. This raw output is usually called the logits or scores. We use the logits because softmax gives you probabilities which will often be very close to zero or one but floating-point numbers can’t accurately represent values near zero or one (read more here). It’s usually best to avoid doing calculations with probabilities, typically we use log-probabilities.

# The MNIST datasets are hosted on yann.lecun.com that has moved under CloudFlare protection
# Reference: https://github.com/pytorch/vision/issues/1938

from six.moves import urllib
opener = urllib.request.build_opener()
urllib.request.install_opener(opener)

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import datasets, transforms

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])

# Note If you haven't seen nn.Sequential yet, please finish the end of the Part 2 notebook.

# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10))

# Define the loss
criterion = nn.CrossEntropyLoss()

# Get our data

images, labels = next(dataiter)

# Flatten images
images = images.view(images.shape[0], -1)

# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)

print(loss)

# TODO: Build a feed-forward network
model =

# TODO: Define the loss
criterion =

### Run this to check your work
# Get our data

images, labels = next(dataiter)

# Flatten images
images = images.view(images.shape[0], -1)

# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)

print(loss)

In my experience it’s more convenient to build the model with a log-softmax output using nn.LogSoftmax or F.log_softmax (documentation). Then you can get the actual probabilities by taking the exponential torch.exp(output). With a log-softmax output, you want to use the negative log likelihood loss, nn.NLLLoss (documentation).

Exercise: Build a model that returns the log-softmax as the output and calculate the loss using the negative log likelihood loss. Note that for nn.LogSoftmax and F.log_softmax you’ll need to set the dim keyword argument appropriately. dim=0 calculates softmax across the rows, so each column sums to 1, while dim=1 calculates across the columns so each row sums to 1. Think about what you want the output to be and choose dim appropriately.

Now that we know how to calculate a loss, how do we use it to perform backpropagation? Torch provides a module, autograd, for automatically calculating the gradients of tensors. We can use it to calculate the gradients of all our parameters with respect to the loss. Autograd works by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. To make sure PyTorch keeps track of operations on a tensor and calculates the gradients, you need to set requires_grad = True on a tensor. You can do this at creation with the requires_grad keyword, or at any time with x.requires_grad_(True).

You can turn off gradients for a block of code with the torch.no_grad() content:

x = torch.zeros(1, requires_grad=True)
...     y = x * 2
False

Also, you can turn on or off gradients altogether with torch.set_grad_enabled(True|False).

The gradients are computed with respect to some variable z with z.backward(). This does a backward pass through the operations that created z.

x = torch.randn(2,2, requires_grad=True)
print(x)

"""
tensor([[-1.5539,  0.3890],
"""

y = x**2
print(y)

"""
tensor([[2.4146, 0.1513],
"""

# Below we can see the operation that created y, a power operation PowBackward0.

## grad_fn shows the function that generated this variable

"""
<PowBackward0 object at 0x0000015D72A12148>
"""

z = y.mean()
print(z)

"""
"""

# You can check the gradients for x and y but they are empty currently.

"""
None
"""

z.backward()
print(x/2)

"""
tensor([[-0.7769,  0.1945],
[ 0.2544, -0.1241]])
tensor([[-0.7769,  0.1945],
"""



The autograd module keeps track of these operations and knows how to calculate the gradient for each one. In this way, it’s able to calculate the gradients for a chain of operations, with respect to any one tensor. Let’s reduce the tensor y to a scalar value, the mean.

To calculate the gradients, you need to run the .backward method on a Variable, z for example. This will calculate the gradient for z with respect to x

These gradients calculations are particularly useful for neural networks. For training we need the gradients of the cost with respect to the weights. With PyTorch, we run data forward through the network to calculate the loss, then, go backwards to calculate the gradients with respect to the loss. Once we have the gradients we can make a gradient descent step.

When we create a network with PyTorch, all of the parameters are initialized with requires_grad = True. This means that when we calculate the loss and call loss.backward(), the gradients for the parameters are calculated. These gradients are used to update the weights with gradient descent. Below you can see an example of calculating the gradients using a backwards pass.

# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
images, labels = next(dataiter)
images = images.view(images.shape[0], -1)

logits = model(images)
loss = criterion(logits, labels)

loss.backward()

"""
Before backward pass:
None
After backward pass:
tensor([[-0.0007, -0.0007, -0.0007,  ..., -0.0007, -0.0007, -0.0007],
[-0.0043, -0.0043, -0.0043,  ..., -0.0043, -0.0043, -0.0043],
[ 0.0006,  0.0006,  0.0006,  ...,  0.0006,  0.0006,  0.0006],
...,
[ 0.0023,  0.0023,  0.0023,  ...,  0.0023,  0.0023,  0.0023],
[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
[ 0.0028,  0.0028,  0.0028,  ...,  0.0028,  0.0028,  0.0028]])
"""

## Training the network!

There’s one last piece we need to start training, an optimizer that we’ll use to update the weights with the gradients. We get these from PyTorch’s optim package. For example we can use stochastic gradient descent with optim.SGD. You can see how to define an optimizer below.

Now we know how to use all the individual parts so it’s time to see how they work together. Let’s consider just one learning step before looping through all the data. The general process with PyTorch:

• Make a forward pass through the network
• Use the network output to calculate the loss
• Perform a backward pass through the network with loss.backward() to calculate the gradients
• Take a step with the optimizer to update the weights

Below I’ll go through one training step and print out the weights and gradients so you can see how it changes. Note that I have a line of code optimizer.zero_grad(). When you do multiple backwards passes with the same parameters, the gradients are accumulated. This means that you need to zero the gradients on each training pass or you’ll retain gradients from previous training batches.

### Training for real

Now we’ll put this algorithm into a loop so we can go through all the images. Some nomenclature, one pass through the entire dataset is called an epoch. So here we’re going to loop through trainloader to get our training batches. For each batch, we’ll doing a training pass where we calculate the loss, do a backwards pass, and update the weights.

Exercise: Implement the training pass for our network. If you implemented it correctly, you should see the training loss drop with each epoch.

# The MNIST datasets are hosted on yann.lecun.com that has moved under CloudFlare protection
# Reference: https://github.com/pytorch/vision/issues/1938

from six.moves import urllib
opener = urllib.request.build_opener()
urllib.request.install_opener(opener)

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import datasets, transforms

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])

# Note If you haven't seen nn.Sequential yet, please finish the end of the Part 2 notebook.

# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10))

# Define the loss
criterion = nn.CrossEntropyLoss()

# Get our data

images, labels = next(dataiter)

# Flatten images
images = images.view(images.shape[0], -1)

# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)

print(loss)

# TODO: Build a feed-forward network
model =

# TODO: Define the loss
criterion =

### Run this to check your work
# Get our data

images, labels = next(dataiter)

# Flatten images
images = images.view(images.shape[0], -1)

# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)

print(loss)

# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
images, labels = next(dataiter)
images = images.view(images.shape[0], -1)

logits = model(images)
loss = criterion(logits, labels)

loss.backward()

from torch import optim

# Optimizers require the parameters to optimize and a learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)

print('Initial weights - ', model[0].weight)

images, labels = next(dataiter)
images.resize_(64, 784)

# Forward pass, then backward pass, then update weights
output = model(images)
loss = criterion(output, labels)
loss.backward()

# Take an update step and view the new weights
optimizer.step()
print('Updated weights - ', model[0].weight)

model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
running_loss = 0
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape[0], -1)

# TODO: Training pass

loss =

running_loss += loss.item()
else:

# With the network trained, we can check out it's predictions.

import helper

images, labels = next(dataiter)

img = images[0].view(1, 784)
# Turn off gradients to speed up this part