Post

Weight Initialization

Weight Initialization

🔥 Weight Initialization in Deep Neural Networks


1️⃣ Small Gaussian Random Initialization

Initialize weights as:

\[W_{ij} \sim \mathcal{N}(0, \sigma^2)\]

with

\[\sigma = 0.01\]

In practice:

1
W = 0.01 * np.random.randn(d_in, d_out)

❗ Problem in Deep Networks

Let:

\[y = Wx\]

where

\[x \in \mathbb{R}^{d_{in}}, \quad W \in \mathbb{R}^{d_{out} \times d_{in}}\]

Each output unit:

\[y_i = \sum_{j=1}^{d_{in}} W_{ij} x_j\]

If weights are very small:

\[W_{ij} \approx 0\]

Then:

\[y_i \approx 0\]

For tanh activation:

\[\tanh(0) = 0\]

Gradients:

\[\frac{\partial \tanh(z)}{\partial z} = 1 - \tanh^2(z)\]

If activations shrink layer by layer:

\[x^{(l)} \to 0\]

Then:

\[\frac{\partial L}{\partial W} \propto x^{(l)}\]

So:

\[\frac{\partial L}{\partial W} \to 0\]

🚨 No learning (vanishing gradients)


2️⃣ Large Gaussian Initialization

Now increase scale:

1
W = 5 * np.random.randn(d_in, d_out)

Then:

\[y_i = \sum W_{ij} x_j\]

becomes very large.

For tanh:

\[\tanh(z) \to \pm 1\]

Derivative:

\[\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)\]

If:

\[\tanh^2(z) \approx 1\]

Then:

\[1 - \tanh^2(z) \approx 0\]

Again:

\[\frac{\partial L}{\partial W} \to 0\]

🚨 Saturation → No learning


3️⃣ Xavier Initialization (Glorot)

Goal:

Maintain

\[\text{Var}(y) = \text{Var}(x)\]

📌 Derivation

Start from:

\[y_i = \sum_{j=1}^{d_{in}} W_{ij} x_j\]

Variance:

\[\text{Var}(y_i) = \text{Var}\left(\sum_{j=1}^{d_{in}} W_{ij} x_j \right)\]

Assuming independence:

\[= \sum_{j=1}^{d_{in}} \text{Var}(W_{ij} x_j)\]

Since:

\[\text{Var}(XY) = \text{Var}(X)\text{Var}(Y)\]

and all terms i.i.d:

\[= d_{in} \cdot \text{Var}(W) \cdot \text{Var}(x)\]

To preserve variance:

\[\text{Var}(y) = \text{Var}(x)\]

So:

\[d_{in} \cdot \text{Var}(W) = 1\]

Thus:

\[\text{Var}(W) = \frac{1}{d_{in}}\]

Standard deviation:

\[\sigma = \frac{1}{\sqrt{d_{in}}}\]

Implementation:

1
W = np.random.randn(d_in, d_out) / np.sqrt(d_in)

✅ Keeps activations stable across layers


4️⃣ Why Xavier Fails for ReLU

ReLU:

\[\text{ReLU}(z) = \max(0, z)\]

It zeroes half of the distribution.

Thus:

\[\text{Var}(x_{after}) = \frac{1}{2}\text{Var}(x_{before})\]

So variance shrinks by factor 1/2 each layer.


5️⃣ Kaiming / MSRA Initialization (He Initialization)

Correct for ReLU by compensating factor 2.

We want:

\[d_{in} \cdot \text{Var}(W) \cdot \frac{1}{2} = 1\]

Thus:

\[\text{Var}(W) = \frac{2}{d_{in}}\]

Standard deviation:

\[\sigma = \sqrt{\frac{2}{d_{in}}}\]

Implementation:

1
W = np.random.randn(d_in, d_out) * np.sqrt(2.0 / d_in)

✅ Works well for deep ReLU networks


📊 Summary

InitializationVariance
Small GaussianToo small → vanish
Large GaussianToo large → saturate
Xavier1/d_in
Kaiming2/d_in

🚀 Final Recommendation

  • tanh → Xavier
  • ReLU → Kaiming (He)
  • Deep networks → Always scale by fan-in
This post is licensed under CC BY 4.0 by the author.