Post

Activation Functions

Activation Functions

🧠 1. Why Activation Functions Matter

Without activation:

\[f(x) = W_2(W_1 x)\]

This simplifies to:

\[f(x) = (W_2 W_1)x\]

❌ Still linear.
❌ Cannot model complex patterns.

With activation:

\[f(x) = W_2 a(W_1 x)\]

✅ Introduces non-linearity
✅ Enables deep learning power


🔵 2. Sigmoid Function

📌 Definition

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Range:

\[0 < \sigma(x) < 1\]

🧮 Full Derivative Derivation

Start:

\[\sigma(x) = (1 + e^{-x})^{-1}\]

Differentiate:

\[\frac{d}{dx}\sigma(x) = -1(1+e^{-x})^{-2}(-e^{-x})\] \[= \frac{e^{-x}}{(1+e^{-x})^2}\]

Rewrite:

\[= \frac{1}{1+e^{-x}}\left(1 - \frac{1}{1+e^{-x}}\right)\]

Therefore:

\[\sigma'(x) = \sigma(x)(1-\sigma(x))\]

⚠️ Vanishing Gradient

If $x \to +\infty$:

\[\sigma'(x) \to 0\]

If $x \to -\infty$:

\[\sigma'(x) \to 0\]

🚨 Gradients vanish in deep networks.


⚠️ Not Zero-Centered

\[\sigma(x) > 0\]

All outputs positive → inefficient zig-zag gradient updates.


🟢 3. Tanh Function

📌 Definition

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Relation:

\[\tanh(x) = 2\sigma(2x) - 1\]

Range:

\[-1 < \tanh(x) < 1\]

🧮 Derivative

\[\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)\]

✔ Zero-centered
❌ Still saturates for large $|x|$


🔴 4. ReLU (Rectified Linear Unit)

📌 Definition

\[\text{ReLU}(x) = \max(0,x)\]

Piecewise:

\[\text{ReLU}(x) = \begin{cases} x & x > 0 \\ 0 & x \le 0 \end{cases}\]

🧮 Derivative

\[\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x < 0 \end{cases}\]

✔ No saturation for $x>0$
✔ Fast computation
✔ Sparse activation


⚠️ Dead ReLU

If neuron always outputs 0:

\[\nabla = 0\]

Weights stop updating ❌


🟡 5. Leaky ReLU

📌 Definition

\[f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \le 0 \end{cases}\]

🧮 Derivative

\[f'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \le 0 \end{cases}\]

✔ Prevents dead neurons
✔ Keeps small gradient for negative region


🟣 6. ELU (Exponential Linear Unit)

📌 Definition

\[\text{ELU}(x) = \begin{cases} x & x \ge 0 \\ \alpha(e^x - 1) & x < 0 \end{cases}\]

🧮 Derivative

For $x \ge 0$:

\[\frac{d}{dx} = 1\]

For $x < 0$:

\[\frac{d}{dx} = \alpha e^x\]

✔ Closer to zero-centered
✔ Smooth negative region


📊 7. Comparison Table

Activation Saturation Zero-Centered Vanishing Gradient ———— ———— ————— ——————– Sigmoid Yes No Severe Tanh Yes Yes Moderate ReLU No (+side) No Partial Leaky ReLU No No Minimal ELU Mild Closer Minimal


🎯 8. Final Practical Advice

✅ Use ReLU by default
✅ Try Leaky ReLU / ELU for improvements
❌ Avoid Sigmoid/Tanh in deep hidden layers


🚀 Core Insight

Activation functions control:

\[\text{Non-linearity}\] \[\text{Gradient flow}\] \[\text{Optimization stability}\]

They are fundamental to deep learning success.

This post is licensed under CC BY 4.0 by the author.