Explaining Neural Network as Simple as Possible 2— Gradient Descent

Slope, Gradients, Jacobian, Loss Function and Gradient Descent

Alex Punnen
Better ML
Published in
15 min readMar 9, 2024

We have seen in the previous article how the predecessor to the current Neural Network, the Perceptron worked. If you have not read it, please do so first @ https://medium.com/@alexcpn/explaining-neural-network-as-simple-as-possible-the-perceptron-a0c33278b97f

We need a few more concepts to understand a modern neural network. A neural network is conceptually simple. It is a set of weight matrices arranged as layers, forming a computation chain to give another matrix, the desired output. This is called the Forward Pass of the Network or Inference flow. This is what we use when we interact with a Neural Network.

Neural Network Forward Pass View

However, the magic of Neural Networks is to train it to get the expected results we want in the forward pass. This is where things like gradient descent are used.

The beauty of a Neural Network is that given a set of known Inputs and its Expected Outputs ( called training set — an example could be a set of cat and dog photos with a text label associated with each photo specifying if it is a cat or dog), using the simple mathematical concept gradient, the weight values in each layer can be adjusted so that the network output matches the expected output for most of the inputs. This is called training the Neural Network.

To iteratively adjust the weights in each layer during training using gradient information we use the gradient descent algorithm.

Neural Network Training -Calculate Loss
Neural Network Training- Calculate Gradient
Neural Network Training — Adjsut weights via Gradient Descent

Let’s see this in more detail.

Gradient Descent Algorithm.

This algorithm is used to iteratively find the weights of a neural network using the gradients calculated by backpropagation to minimise the loss function. We will see each of these terms one by one. Compared to Backpropagation, Gradient Descent is a simple but important concept. However, it is tied together with gradients and the Loss function. You need to understand a bit of all three to understand how the Gradient descent works in a Deep Learning context.

In the context of Neural Networks, the article in Nature magazine (1986) by David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams introduced Backpropagation. It also refers to Yann Le Cun’s paper — Learning Process in an Asymmetric Threshold Network of 1986 which speaks of Gradient Descent, meaning that they have used the gradient descent technique with backpropagation. LeNet (1998) by Yann Le Cun and later AlexNet (2012) by Alex Krizhevsky in collaboration with Ilya Sutskever (OpenAI Chief Scientist) and Geoffrey Hinton cemented this firmly.

Take 1: Rolling down the hill

Well, think of it this way. You are on top of a hill on a foggy day and must reach the bottom. What would you do? You step downwards the slope intuitively. Assuming a ball is to find a way to the bottom, you roll the ball down the hill. It rolls to the bottom, assuming there are no large hollows or pits in the path it can fall into. If you have raced toy cars down the slope with your friends, you would have noticed that the steeper the slope the faster the car reaches the bottom.

Now all these words like hollows (local minima), steep slope (gradient) etc are the characteristics that define your particular slope and influence this method.

Now if we transfer all the above physical problems to the math realm, that is, if we assume that the hill is a function, and the bottom of the hill is the optimum/minima we need to find, we can intuitively explain gradient descent in isolation. Let us see where it is applied first.

The Optimisation Problem and Gradient Decent Overview

The basic form of the optimisation problem is to find the minimum (or maximum) of a function f(x). Gradient descent is a computationally inexpensive method of finding this, and it can be found even if the function is not convex using something called the gradient.

The gradient is a pre-requirest for gradient descent and it is the back-propagation algorithm that computes the gradient in the first place. More on this later.

The usual chart one sees in the explanation of gradient descent is something like below, which is a function similar to y= f(x) = x²-16 which represents a parabolic shape geometrically.

Gradient descent -take 1

Almost intuitively we know from this simple equation that f(x) is minimum at x =0 and the value of y at x=0 is the answer/minimum of the function. This is because x is squared and if you substitute even negative values the value of y is just going to increase. So zero is the minimum.

However, we do not need gradient descent for such a simple function of one variable/parameter. Our neural network’s final layer is an n-parameter function and there is no guarantee that it may be convex like the parabola and definitely in N dimension its shape cannot be visualised easily and hence gradient descent.

Below is a slightly more realistic complex function where gradient descent can still be used.

Gradient descent in a more complex function
Gradient descent converges irrespective of initial starting point x=0.01 (as long as x is not at zero! as the slope is zero there)

Finding the second derivative is computationally expensive. In gradient descent, we do not use it. So it is an approximation. This is why it is also called a first-order approximation. For example, the above graph plots the function (x²-16)². The second derivative of this is 12x²-64. This means there exists some x range where the second derivative can be a negative (due to -64) term. That is the function is not convex for all values of x. Still, we can see empirically in the above graphs that gradient descent finds the minima. However, there is no guarantee.

Before we go further we need to understand the concept of Gradient in a little more detail.

Gradient

The Calculus-based Explanation of Gradient

If you know calculus then you can arrive at the above “intuitive” result a little more rigorously. By your calculus understanding — the first derivative at a point (or more accurately -infinitesimal small delta) is the rate of change of the function at that point/delta, or geometrically the slope of the function at that point. And if the slope or dy/dx =0, it means there is no change in y for even a small change in x indicating that the tangent line to the function at that point is horizontal.

Equation of the first derivative of the function f’(x) is d/dx(x²-16) = 2x.Equating this to zero (2x = 0) and solving we arrive at x=0

This condition f’(x)=0 is necessary for a point to be considered a critical point, which could be a minimum, maximum, or saddle point.

You then use your knowledge of the second derivative of the function — the rate of change of the first derivative, which is the rate of change of slope, which is geometrically the curvature, to check if it is a convex function. If the second derivative/curvature is positive that is the conclusion that the critical point is indeed the minima as the function is convex at x.

For our example function f′′(x)= d²y/dx² = 2, indicating it is a convex function which means x=0 is the minimum and putting this value of x in the original function gives y=-16 as the minimum of the function

If you do not know calculus and do not understand the above paragraph I can try to explain the terms more easily.

We need property similar to gravity to help us apply what we did in the physical space (rolling down the hill) in the maths realm. The concept of the slope of the function given by dx/dy = ∆x/∆y (rise over run) is what we can use here. It is the measure of the slope.

Wikipedia -slope

Let’s take small incremental steps (learning rate) in the direction of the least slope -towards the negative of the slope. We can move towards a local extremum, which in the context of optimization is typically a minimum assuming that the function is a convex function like a parabola.

The gradient is the calculus counterpart of slope which generalizes the idea of slope to multiple dimensions.

Gradient for complex functions — Vector Calculus and the Jacobian

Now calculus has two parts, scalar calculus and vector calculus

The gradient is the first derivative of the function (f’(x) = dx/dy) in the case of single-parameter functions like the x²-16 function.

For multi-parameter functions like f(x, y) = x² + 3xy, it is the vector of the partial derivative of the function for each of its parameters (represented as a matrix ∇, example for the above ∇f = ( 2x + 3y, 3x )

All the above are examples of scalar-valued functions a function that takes single or multiple inputs and gives one output. The calculus that deals with this is scalar-calculus.

Since neural networks have multiple layers each layer composed of multiple weight vectors, scalar calculus is not enough. The same concepts have to be expanded to this multi-variable or multi-variate vector domain.

A vector-valued function is a function that takes multiple values and gives multiple outputs. Example F(x, y) = (x² + y², 2xy). This function takes two inputs (x, y) and outputs a two-dimensional vector

To get the gradient of such a function we cannot use scalar calculus to find the derivative/slope of the function. We cannot take the derivative of multi-variable functions for multiple variables in one go.

The concept of partial derivatives comes into play here. We need to take partial derivatives of the function for each variable. This is vector calculus in neural networks context and it is almost all the same concepts written out as matrices.

The gradient vector of such a function is the Jacobian matrix J(F). For our example function F(x, y) = (x² + y², 2xy) the gradient vector/Jacobian is

Top Row: The partial derivatives of the first component of F(x, y):

  • ∂(x² + y²) / ∂x and ∂(x² + y²) / ∂y giving [ 2x 2y]

Bottom Row: The partial derivatives of the second component of F(x, y):

  • ∂(2xy) / ∂x and ∂(2xy) / ∂y giving [2y 2x]

The Jacobian for all the layers are computed via the backpropagation algorithm and used in Gradient descent.

There is another concept we need to understand — Loss or Cost Function

Gradient Descent to find minima of the Loss or Cost Function

We have seen that Gradient descent is used for solving optimisation in a function. In Neural Networks the optimisation problem is the Loss or Cost Function.

If our neural network is supposed to give y_expected for an input x, and it gives y_output, the loss function in its most simple form is nothing but y_expected -y_output = 0

To remove negatives we square this and take its mean giving us the Mean Squared Error Loss function (y_expected -y_output)²/N = 0

More formally and by applying indices the Loss function can be written below

where y is the expected output and f(x) are the calculated one for the i-th observation, N is the total number of observations. MSE stands for Mean Square Error

Given a gradient at a point, gradient descent is an algorithm which uses the gradient information to find the optimal/minimal parameters of the Loss function (optimum is reducing the loss to minimum)

The parameters in the context of Neural Networks are weights (and biases).

So gradient descent helps in optimising the weight values to reduce the Loss function.

So gradient descent is rather simple, but tied up with a set of other things. It is as simple as rolling down a parabolic curve to its minimum. In the context of its application in neural networks, the curve is the geometry of the N-dimensional loss function.

Let's see the theoretical and simple explanation first

Algorithm

  1. Initialization: Start with an initial guess for x.
  2. Gradient Calculation: Compute the gradient ∇f(x)(or f’(x) at the current x. (not that gradient descent is NOT calculating the gradient in Neural ents but it is done via backpropagation)
  3. Update: Update x by taking a step in the direction opposite to the gradient: x=xαf’(x), where α is the learning rate, a positive scalar determining the step size.
  4. Iteration: Repeat the gradient calculation and update steps until x converges to a minimum, based on a predefined convergence criterion or after a set number of iterations.

The step size/ learning rate etc important. In the theoretical example below we will find the minimum for

using the above gradient descent algorithm.

Note that the derivative of x² is 2x.

import numpy as np
import matplotlib.pyplot as plt
# Define the function g(x) and its derivative dg(x)
def g(x):
return (x**2 - 16)
def dg(x):
return 2*x
# Gradient descent algorithm
def gradient_descent(dg, x_start, learning_rate, tol=1e-6, max_iter=1000):
x = x_start
xs = [x]
for _ in range(max_iter):
grad = dg(x)
x_new = x - learning_rate * grad
xs.append(x_new)
if abs(x_new - x) < tol:
break
x = x_new
return x, xs
# Initial guess and learning rate
x_start = 6
learning_rate = 0.001
# Run gradient descent
minimum, xs = gradient_descent(dg, x_start, learning_rate)
# Plotting
x = np.linspace(-7, 7, 400)
y = g(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='g(x) = $x^2 - 16$')
plt.axhline(0, color='black', lw=0.7)
plt.scatter(xs, g(np.array(xs)), color='red', zorder=5)
plt.plot(xs, g(np.array(xs)), color='red', linestyle='--', label='Iterations')
plt.title('Gradient Descent')
plt.xlabel('x')
plt.ylabel('g(x)')
plt.legend()
plt.grid(True)
plt.show()
print(f"Minimum found at: {minimum}")

Output of the above code (go ahead copy and paste in colab and try changing the learning rate and the max_iteration etc and see if it converges better)

This is a simple code and demonstrates the algorithm. However, it is impossible to understand how gradient descent is used in Neural networks without actually seeing this in practice in a code snippet.

Gradient Decent in Neural Networks

Here is a short code of a two-layered neural network taken from the first part of my colab notebook. The network is simple enough that with the below code, you can understand it at a very high level. Since we are focussing only on the gradient descent part, you do not need to understand anything else but the for loop and its last part.


# set learning rate as 1 for this toy example
learningRate = 1


# Randomly initalised weights
weight1 = np.random.random((3,4))
weight2 = np.random.random((4,1))

# Activation to layer 0 is taken as input x
a0 = x

iterations = 10000
for iter in range(0,iterations):

## The forward pass
z1= np.dot(x,weight1)
a1 = sigmoid(z1)
z2= np.dot(a1,weight2)
a2 = sigmoid(z2)
if iter == 0:
print("Intial Ouput \n",a2)

# Backward Pass - Backpropagation
delta2 = (a2-y)
#---------------------------------------------------------------
# Calcluating change of Cost/Loss wrto weight of Last layer
# Eq (A) ---> dC_dw2
#---------------------------------------------------------------

dC_dw2_1 = delta2*derv_sigmoid(z2)
dC_dw2 = a1.T.dot(dC_dw2_1)

#---------------------------------------------------------------
# Calcluating change of Cost/Loss wrto weight of 2nd last layer
# Eq (B)---> dC_dw1 = derv_sigmoid(z1)*delta2*derv_sigmoid(z2)*weight2*a0.T
# dC_dw1 = derv_sigmoid(z1)*dC_dw2*weight2_1*a0.T
#---------------------------------------------------------------

dC_dw1 = np.multiply(dC_dw2_1,weight2.T) * derv_sigmoid(z1)
dC_dw1 = a0.T.dot(dC_dw1)

#---------------------------------------------------------------
#Gradinent descent
#---------------------------------------------------------------

weight2 = weight2 - learningRate*(dC_dw2)
weight1 = weight1 - learningRate*(dC_dw1)


print("New ouput\n",a2)

#---------------------------------------------------------------
# Training is done, weight2 and weight2 are primed for output y
#---------------------------------------------------------------

Backpropagation calculates the gradient vector for each layer whereas we see the gradient descent updating the weights at the end of the loop for each layer

#---------------------------------------------------------------
#Gradinent descent
#---------------------------------------------------------------

weight2 = weight2 - learningRate*(dC_dw2)
weight1 = weight1 - learningRate*(dC_dw1)

The most important part here is the dC_dw2 — which is the gradient of the Cost function with respect to weights in layer 2 and dC_dw1 which is the gradient of the Loss function with respect to the weights in layer 1.

We are adjusting the weight vectors in each layer towards the path of the steepest downward gradient to minimize the loss function (to get the output of the neural net as close to the target output as possible)

In a bit more detailed way — We are adjusting the weight vector in each layer in a way proportional to how each layer's weights contribute to the final layers' Loss function. This proportionality is the gradient vector of the cost for each layer (dC_dw2 is how a small change in layer 2 weight (w2) affects the Cost function and dC_dw1 is how a small change in layer 1 weights (w1) affects the Cost function and so on)

It is a great leap forward if you can understand this much. Take a break.

Alternative explanations

Gradient descent can theoretically be explained as an approximation of a function to a line at a particular point and then finding the slope of the line and the rest. Taylor series helps in reducing a complex function into a set of linear components using a set of derivatives of the functions (1st,2nd,3rd… infinity) and this way is also a different explanation. All academic explanations explain this way first.

Another way to explain gradient descent is by contrasting it with with Newton-Raphson method which is a second-order method. Newtons method is used to find the square root of a number Assume that we need to find the √16 This is the same as solving the equation x²-16 =0 .We start with an approximate value and iteratively improve. Assume that x₀ is the initial value that we have taken. The derivative of the function at a point (say x₀) is the slope of the tangent at that point. The main intuition is to take this tangent line intercept at the x-axis (x₁) as the next new point and the process is repeated iteratively.

We can see in the figure below where this tangent is touching the x-axis x₁=13.04 for x₀= 0.626

So here we are doing two things. We are taking the slope/gradient/first derivative of points, and then we are taking the x-intercept where the point intersects the x-axis. This is the geometric interpretation.

Newton Raphson Method

The Newton-Raphsons method is of no interest to us. But what it shows is that for quadratic functions (like finding square roots), the function at a point can be approximated as a straight line and the intercepts of the line at the x-axis are taken. How does finding the root help in optimization? This is because the maximum, minima, or inflexion point of the function f(x), happens at f’(x) = 0 as we have seen earlier. So we can use the same method as above but instead of solving for f(x)=0, we need to solve for f’(x) = 0. Substituting in the above equation we get

That is in Newton’s method, we need to take the second derivative of the function to solve the equation. This makes it computationally intensive.

Note that the Netwons method is precise. However, due to it depending on the second derivative (curvature), it works only for specific types of functions — convex functions

The gradient descent method is much simpler than Newton’s method.

Gradient descent only needs the first derivative and hence pretty fast.

If we contrast with Netwons method of using the second derivative which is more precise, we can see that Gradient descent is more of an approximation. It heavily depends on an initial guess, a guessed step size and no local minima which could be mistaken as an optimum point (slope is zero). However empirically it is pretty good in practice with few adaptations, like using adaptive learning rates and tracking momentum. This is well covered in this article https://www.ruder.io/optimizing-gradient-descent/

Practically how this affects your network learning can be seen in my notebook where I faced this problem with a slightly complex network (than the toy network above) and had to use Adam optimizer for gradient descent (along with Xavier/Glorot initialization of weights) so that I get a good set of gradients and gradient descent works properly. Note that the code related to Adam and a few other parts in my notebook was generated with the help of ChatGPT4/Gemini Pro and I used it to learn. Ideally, you just use Adam from PyTorch or Tensorflow libraries.

Better ML
Better ML

Published in Better ML

Software Engineering for ML  — Implementation, Versoning, Operations

Alex Punnen
Alex Punnen

Written by Alex Punnen

SW Architect/programmer- in various languages and technologies from 2001 to now. https://www.linkedin.com/in/alexpunnen/

No responses yet

Write a response