Post

Optimization

Optimization

🚀 Optimization in Deep Learning

From SGD → Momentum → AdaGrad → RMSProp → Adam → Second-Order


🎯 Why Optimization Matters

We want to minimize:

\[J(\theta)\]

by iteratively updating parameters:

\[\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t)\]

where:

  • $\alpha$ = learning rate
  • $\nabla J(\theta)$ = gradient

1️⃣ Problems with SGD

📌 SGD Update

\[\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t)\]

⚠️ Problem 1: Jittering (Slow Progress)

When curvature differs across directions:

  • Oscillation in steep direction
  • Slow movement in flat direction

This relates to the condition number of the Hessian:

\[\kappa(H) = \frac{\lambda_{\max}}{\lambda_{\min}}\]

Large $\kappa$ ⇒ slow convergence.


⚠️ Problem 2: Local Minima & Saddle Points

At stationary points:

\[\nabla J(\theta) = 0\]

Then:

\[\theta_{t+1} = \theta_t\]

No update occurs 😵

In high-dimensional space, saddle points are common.


⚠️ Problem 3: Noisy Gradient (Mini-batch)

SGD uses approximation:

\[\nabla J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} \nabla L_i(\theta)\]

Small batch → high variance
Large batch → expensive

Tradeoff between noise & computation ⚖️


2️⃣ SGD + Momentum 🏎️

💡 Idea

Like physics momentum:

\[\text{momentum} = \text{mass} \times \text{velocity}\]

We accumulate gradient history.


🔄 Update Rule

Velocity:

\[v_{t+1} = \rho v_t - \alpha \nabla J(\theta_t)\]

Parameter:

\[\theta_{t+1} = \theta_t + v_{t+1}\]

where:

  • $\rho \approx 0.9$

✅ Benefits

  • Reduces oscillation
  • Accelerates convergence
  • Escapes shallow saddle regions

3️⃣ AdaGrad 📊

💡 Idea

Adapt learning rate per parameter.

Accumulate squared gradients:

\[G_t = \sum_{i=1}^{t} g_i^2\]

Update:

\[\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t\]

❌ Weakness

$G_t$ keeps growing:

\[\frac{\alpha}{\sqrt{G_t}} \to 0\]

Learning rate shrinks too much 📉


4️⃣ RMSProp 🔄

Fixes AdaGrad by exponential moving average.

Running average:

\[s_t = \beta s_{t-1} + (1 - \beta) g_t^2\]

Update:

\[\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{s_t + \epsilon}} g_t\]

Typical:

  • $\beta = 0.9$

5️⃣ Adam 🔥

Adaptive Moment Estimation
Combines Momentum + RMSProp.


📌 First Moment (Mean)

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t\]

📌 Second Moment (Variance)

\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]

Bias correction:

\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\] \[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

Final update:

\[\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]

🔢 Typical Hyperparameters

ParameterValue
$\beta_1$0.9
$\beta_2$0.999
$\alpha$1e-3

6️⃣ First vs Second Order Optimization


🟢 First-Order Methods

Use gradient only:

\[\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t)\]

Examples:

  • SGD
  • Momentum
  • RMSProp
  • Adam

✅ Cheap
✅ Scalable


🔵 Second-Order Methods

Use Hessian:

\[H = \nabla^2 J(\theta)\]

Newton update:

\[\theta^* = \theta_0 - H^{-1} \nabla J(\theta_0)\]

📐 Taylor Expansion

\[J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla J(\theta_0) + \frac{1}{2} (\theta - \theta_0)^T H (\theta - \theta_0)\]

More accurate curvature modeling 🎯


❌ Why Rare in Deep Learning?

If dimension = $d$

  • Hessian size: $O(d^2)$
  • Inversion cost: $O(d^3)$

Modern networks:

\[d \sim 10^7 \text{ to } 10^9\]

Too expensive 💸


🧠 Summary Table

MethodKey IdeaStrengthWeakness
SGDRaw gradientSimpleSlow, oscillation
MomentumVelocityFasterNeeds tuning
AdaGradPer-dim LRGood for sparseLR shrinks too much
RMSPropEMA scalingStableExtra hyperparameter
AdamMomentum + RMSPropDefault choiceSometimes overfits

🎓 Practical Tips

  • 🔥 Start with Adam
  • 🏎️ Try SGD + Momentum for better generalization
  • 📉 Always use LR scheduling
  • 📊 Monitor validation metric, not just loss

🎉 Final Insight

Optimization is not just about minimizing loss.

It determines:

  • Convergence speed 🚀
  • Stability ⚖️
  • Generalization 📈

Choosing optimizer = choosing training dynamics.

This post is licensed under CC BY 4.0 by the author.