Success of Deep Learning.

6 min readNov 3, 2017


In the last post, I gave you an overview of Deep Learning. In this post we’ll try to focus on the main reason behind the success of Deep Learning, i.e, Neural Networks.

Almost all the major companies use Deep Learning for tasks involving speech recognition, speech synthesis and machine translation. As far as machine translation is concerned, since deep learning encodes language to its vectorial features, it is even possible to learn translations between pairs of languages on which the model has not been trained. In other words, if the model knows how to translate from German to English and how to translate from English to French, it’ll also be able to translate directly from German to French without going through the intermediate English language.

Deep Learning has won almost all the recent competitions on image classifications and by now performs better than humans at the task of recognizing objects in an image.

It has also been used successfully for image captioning, which means generating a textual description of the content of the image.

Industrial applications of Deep Learning range from controlling the temperature of the data centers to managing crops and agricultural planning and even to autonomous vehicles.

In fact, self driving car research is no longer available to only large companies. Startups are entering the market thanks to how cheap and powerful deep learning is for such systems.

I’d like to emphasis here that the main reason behind this amazing success of Deep Learning model over traditional Machine Learning models is the advancements in Neural Networks.

Artificial Neural Networks (ANN) take their name and inspiration from biology. Brain tissues are composed of cells, and these cells are called neurons. Neurons exchange signals from one another form a very complex and dense network. ANNs are simpler than the neural networks found in brain, but they do share a couple of key components in the architecture.

Of all the branches of a neuron, only one carries and output signal whereas there may be multiple inputs.

Now I’m taking a leap of faith and hoping that you are familiar with the basics of Machine Learning. So lets look into how you’ll represent the operation of Linear Regression as an ANN like the one in this figure. This network has only one node, the output node. This node is connected to the input X by the weight W. A second edge is connected to the node with the value ‘b’ called bias. The output of the node is y calculated by the formula given in the figure.

As X and W are vectors with end components, we can easily perform Linear Regression for multiple inputs. Thus, we can now visually represent Linear Regression with as many inputs as we like. Now my question for you is how would you extend this graph to allow for binary output? If we can do that we can also represent Logistic Regression. I’m sure you must have guessed it right to mapping the output of this node to a Sigmoid function to the interval 0–1, thus predicting the probability of a binary outcome. Adding a Sigmoid function is just a special case of what is called an Activation Function.

The first ANN to implement a binary classification used a different activation function and is called a Perceptron. The perceptron is a binary classifier but instead of using a Sigmoid function uses a Step Function.

“Too much talk on activation functions buddy. But WHAT ARE ACTIVATION FUNCTIONS?”

Alright alright. Lets look into activation functions!

Activation Functions are non-linear functions applied when passing the output of a node to the next node. They are the key ingredients of neural networks and they are what makes neural networks so versatile. Activation functions are decision making functions that determines the presence of particular neural feature. It is mapped between 0 and 1, where zero mean the feature is not there, while one means the feature is present.

“Umm.. So.. Why should activation function be non-linear?”

Good question! Find the answer here.

Lets now quickly look into commonly used activation functions.


We use sigmoid when defining Logistic Regression. All we need to worry about is the equation and the graph of sigmoid function. Now its your task to make a function for sigmoid. If you are unable to do so, refer to this link.


The step function obtains a result similar to sigmoid for very large positive and very large negative values of x, snapping the positive values to 1 and the negative values to zero. It does so with a very sharp jump at x=0. As already mentioned above, we use step function in a multi-layer perceptron. Again, try to make your own step function in python. You may refer this link.


Tanh or Hyperbolic Tangent is similar to sigmoid function. It varies from -1 to 1, thus penalizing the negative values of x. Code here.


Relu or Rectified Linear Unit or Rectifier, is more effective than sigmoid and tanh in a neural network. Why? Because biology says so. There’s no logical reason for it as far as I know. If you know of a reason, please leave it in the comments below. It is the most popular activation function for deep neural networks. As you can infer from the graph, Relu is not bounded on the positive side. This turns out to be useful to improve the training speed.


The Softplus function is a smoother version of the Relu. It is defined as log(1+e^x).

We may use any of these functions to connect output of one layer to the input of next layer in order to make the neural network non-linear. This is the secret power of Neural Networks. With non-linearities at each layer, they are able to approximate very complex functions and deal with any sorts of inputs and outputs.

In the next post we’ll look more into the mathematics part of Deep Learning.
As always
Feedback and contributions are highly appreciated.

Show your support by Clapping!!!

Buy me a Coffee: