Post

Learning Rate

Learning Rate

🚀 Learning Rate & Scheduling — Complete Notes


1️⃣ How to Choose a Learning Rate

All optimizers (SGD, Momentum, AdaGrad, RMSProp, Adam) require a learning rate $\alpha$ as a hyperparameter.

Update rule:

\[\theta_{t+1} = \theta_t - \alpha_t \nabla_\theta \mathcal{L}(\theta_t)\]

Too High Learning Rate

If $\alpha$ is too large:

\[\theta_{t+1} \text{ overshoots the minimum}\]

Loss may explode.


Too Low Learning Rate

If $\alpha$ is too small:

\[\theta_{t+1} \approx \theta_t\]

Training becomes extremely slow.


2️⃣ Why Use Learning Rate Decay?

Initially:

  • Large $\alpha$ helps capture global structure.

Near optimum:

  • Smaller $\alpha$ needed for fine convergence.

3️⃣ Step Decay

Reduce learning rate at fixed milestones.

Example:

\[\alpha \leftarrow 0.1 \alpha\]

at 50% and 75% of training.


4️⃣ Cosine Decay

\[\alpha_t = \frac{1}{2} \alpha_0 \left(1 + \cos\left(\frac{\pi t}{T}\right)\right)\]

Where:

  • $\alpha_0$ = initial learning rate\
  • $t$ = current epoch\
  • $T$ = total epochs

5️⃣ Linear Decay

\[\alpha_t = \alpha_0 \left(1 - \frac{t}{T}\right)\]

6️⃣ Inverse Square Root Decay

\[\alpha_t = \frac{\alpha_0}{\sqrt{t}}\]

Common in Transformer training.


7️⃣ Warmup Strategy

Large initial $\alpha$ can explode training.

Warmup gradually increases learning rate:

\[\alpha_t = \alpha_0 \frac{t}{T_{warmup}}\]

for $t \le T_{warmup}$

Then normal decay begins.


8️⃣ Summary

ScheduleFormula
Step$\alpha \leftarrow 0.1 \alpha$
Cosine$\frac{1}{2}\alpha_0(1+\cos(\pi t/T))$
Linear$\alpha_0(1 - t/T)$
Inverse Sqrt$\alpha_0 / \sqrt{t}$
Warmup$\alpha_0 t/T_{warmup}$

✅ Practical Rule:

  • Start large
  • Decay gradually
  • Use warmup for deep networks
This post is licensed under CC BY 4.0 by the author.