FastAI : Training a Digit Classifier - [Chapter-4] - Part II

Metric is to drive human understanding and the loss is to drive automated learning.

Stochastic Gradient Descent –

As Arthur Samuel had mentioned the description of machine learning

Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into details of such a procedure to see that is could be name entirely automatic and to see that a machine so programmed would “learn” from its experience

Instead of trying to find the similarity between an image and an “ideal image” (as shown in Part I)we could instead look at each individual pixel and come up with a set of weights for each, such that the highest weights are associated with those pixels most likely to be black for a particular category.

For example, pixels towards the bottom right are not very likely to be activated for a 7, so they should have a low weight for a 7, but they are likely to be activated for an 8, so they should have a high weight for an 8.

This can be represented as a function and set of weight values for each possible category—for instance the probability of being the number 8:

def pr_eight(x,w): 
    return (x*w).sum()

Here we are assuming that x is the image, represented as a vector—in other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector w. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them. We want to find the specific values for the vector w that causes the result of our function to be high for those images that are actually 8s, and low for those images that are not. Searching for the best vector w is a way to search for the best function for recognising 8s.


The loss/cost function quantifies the distance between the real and predicted value of target. For a regression problem, the most popular loss function is the sum of squared errors.

Gradient descent is an iterative optimization algorithm for finding the local minimum of a function. The main idea is climbing down a hill until a local or global cost minimum is reached.

Gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. So we update the parameters in the negative direction of the gradient, hence called Gradient descent.

From Khan Academy document

So, the goal of the gradient descent is to minimize the given loss/cost function by performing the below two steps iteratively:

  1. Compute the gradient (slope)
  2. Take a step in the direction opposite to the gradient

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw gradient.ipynb hosted with ❤ by GitHub

Batch Gradient Descent – In this method, the weights update is calculated based on all samples in the training set. With very large dataset with millions of data points, running batch gradient descent can be computationally quite costly.

Stochastic Gradient Descent – Here the weights are updated incrementally for each training sample. SGD picks a random instance in the training set at every step and computes the gradients based only on that single instance.It typically reaches convergence faster because of the more frequent weight updates. The cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around never settling down. This randomness is good to escape the local minimum but bad that it never settles at the minimum. One way to solve this is introducing learning rate. The steps starts out large then gets smaller and smaller allowing the algorithm to settle at the global minimum.

So going back to Figure -1, the steps are –

  1. Initialize random weights
  2. For each image, use these weights to predict whether it appears to be a 3 or a 7
  3. Based on these predictions, calculate how good the model is (its loss)
  4. Calculate the gradient, which measures for each weight how changing that weight would change the loss
  5. Step (change) all the weights based on that calculation
  6. Go to step – 2 and repeat the process
  7. Iterate until you decide to stop the training process.

In Machine Learning, the function has lots of weights that needs to be adjusted and so when we calculate the derivate, we won’t get back one number, but lot’s of them – a gradient for every weight.

PyTorch is able to automatically compute the derivative using “requires_grad_()” method. The “backward()” method refers to backpropagation, which is the process of calculating the derivative of each layer.

The gradients tell us only the slope of our function, they don’t tell us exactly how far to adjust the parameters. But we can have some intuition – if the slope is very large, that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value.

Learning Rate –

Although the gradient gives us a direction of where we should go, the default magnitude associated with it might just make us overshoot the minimum in the error landscape or approach it infinitesimally slowly. The learning rate determines the size of the steps when updating the weights. If the learning rate is too small then the algorithm will have to go through many iterations to reach the minima of the loss function, taking longer time. And if it’s too high, you might keep jumping across the valley rather than diverging.

To update the weights based on the learning rate –

w -= w.grad * lr 

An End-to-End SGD Example –

Step 1: Initialize the weights/parameters

Step 2: Calculate the predictions

Step 3: Calculate the loss

Step 4: Calculate the gradients

Step 5: Steps/Update the weights

Step 6: Repeat the process

Step 7: Stop

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Sigmoid –

The sigmoid function always outputs a number between 0 and 1. It is used for binary classification, and for multi-class classification you can use Softmax

Sigmoid – Outputs P(target class | x) in [0, 1]

def sigmoid(x): 
    return 1/(1+torch.exp(-x))

Neural Network Model –

Example of a 3-layer neural network with 2 layers of hidden units

Image from
Image from
Activation Functions –
Image from

The Loss Function –

The loss function is calculated for each item in our dataset, and then at the end of an epoch, the loss values are all averaged and the overall mean is reported for the epoch.

Metrics are the values that are printed at the end of each epoch that tell us how our model is doing. It is important to focus on this metric when judging the performance of a model.

SGD and Mini-Batches –

As mentioned earlier, we could calculate loss for the whole dataset and take it average (batch gradient descent), or we could calculate it for a single data item (stochastic gradient descent) and both the approaches are not ideal. So instead we can calculate the average loss for a few data items at a time. This is called a mini-batch. The number of data items in the mini-batch is called the batch size. Larger batch size means that you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but will take longer , and you will process fewer mini-batches per epoch.

Adding Nonlinearity –

Adding multiple linear layers won’t improve the performance of a model. That is because in a linear model in each layer when we multiply things together and add them up multiple times, it’s equivalent to a single layer with different parameters. So adding a nonlinear function (also called activation function as show above in the figure – Sigmoid, ReLu, Tanh) between linear layers decouples the layers and enables the layer to learn on its own.

Why Deep –

Why we use deeper model?

The reason is performance. Based on various experiments performed, it has been observed that small matrices with more layers get better results than larger matrices with few layers.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

ReLUFunction that returns 0 for negative numbers and doesn’t change positive numbers.
Mini-batchA small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).
Forward passApplying the model to some input and computing the predictions.
LossA value that represents how well (or badly) our model is doing.
GradientThe derivative of the loss with respect to some parameter of the model.
Backward passComputing the gradients of the loss with respect to all model parameters.
Gradient descentTaking a step in the directions opposite to the gradients to make the model parameters a little bit better.
Learning rateThe size of the step we take when applying SGD to update the parameters of the model.

Simple Neural Network Model for Full MNIST dataset

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

References –

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s