Activation Functions
Posted: 07 Jul 2025.
Last modified on 07-Jul-25.
This article will take about 2 minutes to read.
Activation functions are mathematical operations that determine the output of a neural network node, introducing
non-linearity and enabling neural networks to learn complex patterns.
Wikipedia has a list of common activation functions.
I’ve taken some notes on the activation functions, along with their equations and key
characteristics:
- Squashes inputs to the range (0, 1).
- Historically used for binary classification (output layer).
- Suffers from vanishing gradients for extreme inputs.
- Outputs range (-1, 1) (zero-centered).
- Still prone to vanishing gradients but often preferred over sigmoid for hidden layers.
- Most widely used for hidden layers.
- Computationally efficient and avoids vanishing gradients for \(x > 0\).
- Dying ReLU problem: Neurons can become inactive for \(x \leq 0\).
- Introduces a small slope \(\alpha\) for \(x \leq 0\) to mitigate dying neurons.
- \(\alpha\) is typically fixed but small.
- Similar to Leaky ReLU, but \(\alpha\) is learned during training.
- Smooth gradient for \(x \leq 0\) (uses \(\alpha\), often set to 1).
- Reduces bias shift and can outperform ReLU in deeper networks.
- Used in output layers for multi-class classification.
- Converts logits to probabilities (sums to 1 across classes).
- Smooth, non-monotonic function (outperforms ReLU in some deep networks).
- \(\beta\) can be learned or fixed.
- Approximated as \(x \cdot \sigma(1.702x)\).
- Used in models like Transformers (e.g., GPT).
- Used in regression tasks (output layer).
- Applying it to hidden layers reduces the network to a linear model.
- Smooth approximation of ReLU.
- Differentiable everywhere but less efficient computationally.
- Self-normalizing properties for deep networks (when paired with proper initialization).
- Ensures zero mean and unit variance across layers.
Usage Notes:
- ReLU variants (Leaky ReLU, PReLU, ELU) are popular for hidden layers.
- Softmax is standard for classification outputs.
- Swish and GELU are increasingly used in modern architectures (e.g., Transformers).
- SELU requires careful weight initialization (e.g., LeCun normal) to maintain self-normalization.