Stochastic Gradient Descent
Stochastic Gradient Descent
⚡ Stochastic Gradient Descent (SGD)
📘 Clean blog-ready version with visual structure and emojis
🎯 Covers core idea, variants, batch size effects, and practical insights
🚀 Core Idea
Instead of computing gradients using all training examples, SGD computes gradients using a randomly sampled subset of the data.
This greatly improves speed and scalability for large datasets.
🔀 Variants of SGD
1️⃣ Pure Stochastic Gradient Descent
- Updates parameters using one single sample at a time
- Very fast updates ⚡
- But gradients are very noisy 🌪️
- May oscillate around the optimum
2️⃣ Mini‑Batch Gradient Descent (Most Common)
- Uses a mini-batch of samples to estimate the gradient
- Typical batch sizes:
The optimal batch size depends on:
- Problem characteristics 🧠
- Dataset size 📊
- Hardware constraints (especially memory) 💻
📉 Effect of Batch Size
Small Mini‑Batch
- Doubling batch size → significant gradient stabilization
- Noise reduced → smoother convergence
Large Mini‑Batch
- Improvement becomes smaller
- Computation cost increases roughly linearly
- Known as diminishing returns
✅ Advantages of SGD
- Much faster than full gradient descent for large datasets ⚡
- Scales well to massive data 🌍
- Noise can help escape shallow local minima and saddle points 🧗
🛠️ Practical Notes
Mini‑batch SGD is the standard optimization method in deep learning.
Often combined with:
- Momentum 🚀
- Adam ⚙️
- RMSProp 📈
- Learning rate scheduling ⏱️
These techniques improve convergence speed and stability.
🎯 Key Insight
- SGD trades exact gradient for speed and scalability
- Mini‑batch provides a balance between:
- Stability 📊
- Speed ⚡
- Widely used in modern machine learning and deep learning
This post is licensed under CC BY 4.0 by the author.