Mathematics of Deep Learning

3 min readJan 16, 2018


In the previous post we looked at basic reasons behind the success of Deep Learning and got the understanding of commonly used activation functions. In this post we’ll be looking at mathematics behind Deep Learning. Most of the beginners in the field of Deep Learning are usually people who know (Deep Learning -Mathematics). This is probably the case because they are unable to understand the basic intuition behind the Mathematics part of Deep Learning. I’ll try to cover most of it in this post. So without wasting anymore time Lets Begin!

Derivative: The derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value).

Position = x(t) and Speed = dx(t)/dt, where Speed gives us the rate of change of regional function x(t) w.r.t. time.

Partial Derivative: When our function has more than one input, we need to specify which variable we are using for derivation. Lets look at the most common example used to explain gradient descent(we’ll look into it again later), i.e., Mountain Descent example.

Let’s say that elevation(y) at any point is a function of North(n) and East(e). (Obviously South = -n and West = -e)

Therefore, y = f(n, e). Thus to find the direction of fastest change in elevation w.r.t. n and e will be ∂y/∂n and ∂y/∂e respectively. But again, fastest change in direction down (down as we are considering Mountain Descent) won’t necessarily be along any single direction; Right?

It’ll be in whichever direction the hill is most steeply descending down in 2-D plane (our n-e plane). Look at the image below for better understanding.

Direction of Steepest Descent (Green)

In 2-D plane of n and e, the direction of most abrupt change will be a 2-D vector whose components are partial derivatives w.r.t each variable.

We call this vector Gradient ( /del operator).

The magnitude of the gradient is the value of this steepest slope. Gradient is an operation that take a function of multiple variables and return a vector. How is Gradient different from derivative? The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued.

The components of this vector are all the partial derivatives of the function, i.e., (look below)

Thus if we want to go downhill, all we have to do is walk in the direction opposite to the gradient.

Let us take a function(f) with only one variable w. Also, let us consider a value of w, w1, and its corresponding output, f(w1). Thus to find a point(w2) in the direction of minimum value of f(w), we follow the path indicated by its derivative. (Similarly, in case of multiple variables, this path is indicated by Gradient).


Thus, w2 = w1 -(∂f/∂w) (Negative sign as we are trying to go down in our function). As the slope is negative, w2>w1, and hence we get a lower value of f(w). We can infer the same result if we try the concept on the right side of the minima. Do try it, you’ll be sure that you’ve got the concept clearly!

This is the basic idea behind Gradient Descent, we try to find minimum of a function. Gradient Descent is the main idea behind Backpropogation(we’ll look into it later). Bear with me as by the end of this series we’ll get to understand and relate everything very clearly.

Hopefully with this post you got the intuition behind some mathematical part of Deep Learning. We’ll look into a basic model of Neural Network in the next post and relate this mathematical part with it. We’ll clear more mathematical concepts as we progress in this series. As always Feedback and contributions are highly appreciated.

Buy me a Coffee: