Activation Functions

Posted: 07 Jul 2025. Last modified on 07-Jul-25.

This article will take about 2 minutes to read.

Activation functions are mathematical operations that determine the output of a neural network node, introducing non-linearity and enabling neural networks to learn complex patterns.

Wikipedia has a list of common activation functions.

I’ve taken some notes on the activation functions, along with their equations and key characteristics:

Sigmoid (Logistic)

Squashes inputs to the range (0, 1).
Historically used for binary classification (output layer).
- Suffers from vanishing gradients for extreme inputs.

Hyperbolic Tangent (tanh)

Outputs range (-1, 1) (zero-centered).
Still prone to vanishing gradients but often preferred over sigmoid for hidden layers.

Rectified Linear Unit (ReLU)

Most widely used for hidden layers.
Computationally efficient and avoids vanishing gradients for \(x > 0\).
- Dying ReLU problem: Neurons can become inactive for \(x \leq 0\).

Leaky ReLU

Introduces a small slope \(\alpha\) for \(x \leq 0\) to mitigate dying neurons.
\(\alpha\) is typically fixed but small.

Parametric ReLU (PReLU)

Similar to Leaky ReLU, but \(\alpha\) is learned during training.

Exponential Linear Unit (ELU)

Smooth gradient for \(x \leq 0\) (uses \(\alpha\), often set to 1).
Reduces bias shift and can outperform ReLU in deeper networks.

Softmax

Used in output layers for multi-class classification.
Converts logits to probabilities (sums to 1 across classes).

Swish

Smooth, non-monotonic function (outperforms ReLU in some deep networks).
\(\beta\) can be learned or fixed.

Gaussian Error Linear Unit (GELU)

Approximated as \(x \cdot \sigma(1.702x)\).
Used in models like Transformers (e.g., GPT).

Linear (Identity)

Used in regression tasks (output layer).
Applying it to hidden layers reduces the network to a linear model.

Softplus

Smooth approximation of ReLU.
Differentiable everywhere but less efficient computationally.

Scaled Exponential Linear Unit (SELU)

Self-normalizing properties for deep networks (when paired with proper initialization).
Ensures zero mean and unit variance across layers.

Usage Notes:

ReLU variants (Leaky ReLU, PReLU, ELU) are popular for hidden layers.
Softmax is standard for classification outputs.
Swish and GELU are increasingly used in modern architectures (e.g., Transformers).
SELU requires careful weight initialization (e.g., LeCun normal) to maintain self-normalization.