Post

Stochastic Gradient Descent

Stochastic Gradient Descent

⚡ Stochastic Gradient Descent (SGD)

📘 Clean blog-ready version with visual structure and emojis
🎯 Covers core idea, variants, batch size effects, and practical insights


🚀 Core Idea

Instead of computing gradients using all training examples, SGD computes gradients using a randomly sampled subset of the data.

This greatly improves speed and scalability for large datasets.


🔀 Variants of SGD

1️⃣ Pure Stochastic Gradient Descent

  • Updates parameters using one single sample at a time
  • Very fast updates ⚡
  • But gradients are very noisy 🌪️
  • May oscillate around the optimum

2️⃣ Mini‑Batch Gradient Descent (Most Common)

  • Uses a mini-batch of samples to estimate the gradient
  • Typical batch sizes:
\[32,\; 64,\; 128,\; 256,\; \dots,\; 8192\]

The optimal batch size depends on:

  • Problem characteristics 🧠
  • Dataset size 📊
  • Hardware constraints (especially memory) 💻

📉 Effect of Batch Size

Small Mini‑Batch

  • Doubling batch size → significant gradient stabilization
  • Noise reduced → smoother convergence

Large Mini‑Batch

  • Improvement becomes smaller
  • Computation cost increases roughly linearly
  • Known as diminishing returns

✅ Advantages of SGD

  • Much faster than full gradient descent for large datasets ⚡
  • Scales well to massive data 🌍
  • Noise can help escape shallow local minima and saddle points 🧗

🛠️ Practical Notes

Mini‑batch SGD is the standard optimization method in deep learning.

Often combined with:

  • Momentum 🚀
  • Adam ⚙️
  • RMSProp 📈
  • Learning rate scheduling ⏱️

These techniques improve convergence speed and stability.


🎯 Key Insight

  • SGD trades exact gradient for speed and scalability
  • Mini‑batch provides a balance between:
    • Stability 📊
    • Speed ⚡
  • Widely used in modern machine learning and deep learning
This post is licensed under CC BY 4.0 by the author.