A Deep Dive into Activation Functions: A Comprehensive Guide for Neural Network Beginners
If you’re just getting started with neural networks, activation functions might be difficult to understand at first. But trust me, when I say that understanding them is important if you want to develop powerful neural networks.
But before we dive in, let’s quickly go over the basic elements of a neural network architecture. If you’re already familiar with how neural networks work, feel free to skip ahead to the next section.
Neural Networks Architecture
As shown at the picture below, Neural Network consists of layers of linked nodes called neurons, which process and transmit information using weighted connections called synapses.
Each neuron takes input from the preceding layer’s neurons, applies an activation function to the sum of its inputs, and then passes the output to the next layer.
But wait, there’s more to a neural network than just neurons! There are also these other elements called Input Layer, Hidden Layer, and Output Layer. So, what do these do?
The Input Layer just takes in raw data from the domain. There’s no computation done here, the nodes simply pass on the information (also known as features) to the next layer, the Hidden Layer.
The Hidden Layer is where all the computation happens. It takes in the features from the input layer and does all sorts of fancy math on them before passing the results on to the Output Layer.
The Output Layer is the network’s last layer. It uses all of the information gained from the Hidden Layer and produces the final value.
You may be asking why an activation function is required in the first place. Why can’t the neurons just compute and transfer the findings to the next neuron? What’s the point of an activation function anyway?
The Role of Activation Functions in Neural Networks
Each neuron in the network receives input from other neurons, and then it does some math with that input to generate an output. A neuron’s output can then be utilized as input for other neurons in the network.
Yet without activation functions, the neurons would just be doing boring linear math with the inputs. This means that no matter how many layers of neurons we add to the network, it would still be limited in what it can learn because the output would always be a simple linear combination of the inputs.
Activation functions come to the rescue by introducing non-linearity into the network. This indicates that a neuron’s output can be more complex than a simple linear sum of its inputs. By adding non-linearity, the network can model more complex relationships between the inputs and outputs, allowing it to discover more interesting and valuable patterns.
So, in short, activation functions are like the secret sauce that makes neural networks more powerful by introducing non-linearity and allowing them to learn complex patterns.
Breaking Down the Math: Understanding the Different Types of Activation Functions
Now, let’s talk about activation functions. We can categorize these functions into three parts: binary, linear, and non-linear.
Binary functions are basic and can only output one of two possible values, whereas linear functions return values based on a linear equation.
Non-linear functions, such as the sigmoid function, Tanh, ReLU and ELUs, provide results that are not proportional to the input. As a result, each type of activation function has its own unique characteristics that can be useful in different scenarios.
Sigmoid / Logistic Activation Function
Let me break down the Sigmoid Activation Function for you. This function takes any number as input and gives us an output between 0 and 1. The more positive the input, the closer the output will be to 1. On the other hand, the more negative the input, the closer the output will be to 0, as illustrated in the image below.
It has an S-shaped curve, making it ideal for binary classification issues. For example, if we’re creating a model to predict whether or not an email is spam, we could use the Sigmoid function to provide a probability score between 0 and 1. If the score is more than 0.5, the email is considered spam. If it’s less than 0.5, then we can say that it’s not spam.
The function is defined as follows:
Now, there is a drawback to the Sigmoid function — it suffers from the vanishing gradient problem. This indicates that when the input becomes increasingly large or tiny, the gradient of the function becomes very small, slowing down the learning process in deep neural networks. Yet, the Sigmoid function is still used in certain types of neural networks, such as those used for binary classification issues or in the output layer of a multi-class classification problem where we want to predict the probability of each class.
Tanh Function (Hyperbolic Tangent)
So, the Tanh function, also known as the hyperbolic tangent function, is another type of activation function used in neural networks. It takes any real number as input and outputs a value between -1 and 1.
Here’s the thing, the Tanh function is very similar to the Sigmoid function, but it’s a bit more centered around zero. That means when the input is close to zero, the output will be close to zero as well. This can be useful when dealing with data that has both negative and positive values because it can help the network learn better.
The function is defined as follows:
However, like the Sigmoid function, the Tanh function can also suffer from the vanishing gradient problem as the input becomes very large or very small. Yet, the Tanh function is still commonly used in neural networks, especially in the hidden layers of the network.
Rectified Linear Unit / ReLU Function
Rectified Linear Unit, or ReLU, is a common activation function that is both simple and powerful. It takes any input value and returns it if it is positive or 0 if it is negative. In other words, ReLU sets all negative values to 0 and keeps all positive values as they are.
The function is defined as follows:
One of the benefits of using ReLU is that it is computationally efficient and simple to implement. It is also known for helping to mitigate the vanishing gradient problem that can occur in deep neural networks.
However, ReLU can suffer from a problem known as the “dying ReLU” problem. This happens when a neuron’s input is negative, leading the neuron to output 0. If this happens too frequently, the neuron “dies” and stops learning.
Leaky ReLU Function
The Leaky ReLU function is an extension of the ReLU function that attempts to solve the “dying ReLU” problem. Instead of setting all negative values to 0, Leaky ReLU sets them to a small positive value, such as 0.1 times the input value. his guarantees that even if a neuron receives negative information, it may still learn from it.
The function is defined as follows:
Leaky ReLU has been shown to work well in many different types of problems and is a popular choice among deep learning practitioners.
Parametric ReLU Function / PReLU
The Parametric ReLU (PReLU) function is another extension of the ReLU function. Instead of using a fixed value for the negative part of the function, PReLU makes it a learnable parameter. This means that during training, the network can learn the optimal value for the negative portion of the function.
The function is defined as follows:
Where “a” is the slope parameter for negative values.
PReLU has been shown to work well in some types of problems, particularly in image recognition tasks.
Exponential Linear Units (ELUs) Function
Another form of activation function that has gained prominence in recent years is Exponential Linear Units or ELUs. Like ReLU, they aim to address the vanishing gradient problem. ELUs introduce a non-zero slope for negative inputs, which aids in the prevention of the “dying ReLU” problem.
The formula for Exponential Linear Units (ELUs) is:
Where alpha
is a hyperparameter that controls the degree of negative saturation.
ELUs have been shown to improve both training and test accuracy compared to other activation functions like ReLU and tanh. They are particularly useful in deep neural networks that require a high level of accuracy.
Softmax Function
The softmax function is often used as the activation function in the output layer of a neural network that needs to classify inputs into multiple categories. It takes as input a vector of real numbers and returns a probability distribution that represents the likelihood of each category.
The formula for softmax is:
Where x
is the input vector and i
and j
are indices that range from 1 to the number of categories.
Softmax is useful for multi-class classification problems because it ensures that the output probabilities sum to 1, making it easy to interpret the results. It is also differentiable, which allows it to be used in backpropagation during training.
Swish
The Swish function is a relatively new activation function that has gained attention in the deep learning community for its improved performance over other activation functions like ReLU.
The formula for Swish is:
Where beta
is a hyperparameter that controls the degree of saturation.
Swish is similar to ReLU in that it is a simple function that can be computed efficiently. It does, however, have a smooth curve that aids in the prevention of the “dying ReLU” problem. Swish has been shown to outperform ReLU on a variety of deep learning tasks.
Which One to Choose for Your AI Model ?
First things first, you need to match your activation function with the type of prediction problem that you’re solving. This means the type of predicted variable. You can start with the ReLU activation function and then move on to other activation functions if you don’t achieve the desired results.
Here are some guidelines to keep in mind:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they can cause problems during training.
- Swish function is used in neural networks with a depth greater than 40 layers.
The activation function for your output layer is determined by the sort of prediction issue you’re solving. These are some ground guidelines to remember:
- Regression — Linear Activation Function
- Binary Classification — Sigmoid/Logistic Activation Function
- Multiclass Classification — Softmax
- Multilabel Classification — Sigmoid
The activation function used in hidden layers is typically chosen based on the type of neural network architecture. For example:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Remember, choosing the right activation function can make all the difference in the accuracy of your predictions. So, go ahead and choose the right activation function for your neural network, and see the magic happen!
Links / References
Here’s some links for you, that I used doing this research.
This is the end of today’s post. Thanks for reading!
Never miss a great story again by following me. Happy learning to everyone!