Stochastic Gradient Descent

Posted Feb 5, 2026 Updated Mar 25, 2026

By

1 min read

Stochastic Gradient Descent

⚡ Stochastic Gradient Descent (SGD)

📘 Clean blog-ready version with visual structure and emojis
🎯 Covers core idea, variants, batch size effects, and practical insights

🚀 Core Idea

Instead of computing gradients using all training examples, SGD computes gradients using a randomly sampled subset of the data.

This greatly improves speed and scalability for large datasets.

🔀 Variants of SGD

1️⃣ Pure Stochastic Gradient Descent

Updates parameters using one single sample at a time
Very fast updates ⚡
But gradients are very noisy 🌪️
May oscillate around the optimum

2️⃣ Mini‑Batch Gradient Descent (Most Common)

Uses a mini-batch of samples to estimate the gradient
Typical batch sizes:

\[32,\; 64,\; 128,\; 256,\; \dots,\; 8192\]

The optimal batch size depends on:

Problem characteristics 🧠
Dataset size 📊
Hardware constraints (especially memory) 💻

📉 Effect of Batch Size

Small Mini‑Batch

Doubling batch size → significant gradient stabilization
Noise reduced → smoother convergence

Large Mini‑Batch

Improvement becomes smaller
Computation cost increases roughly linearly
Known as diminishing returns

✅ Advantages of SGD

Much faster than full gradient descent for large datasets ⚡
Scales well to massive data 🌍
Noise can help escape shallow local minima and saddle points 🧗

🛠️ Practical Notes

Mini‑batch SGD is the standard optimization method in deep learning.

Often combined with:

Momentum 🚀
Adam ⚙️
RMSProp 📈
Learning rate scheduling ⏱️

These techniques improve convergence speed and stability.

🎯 Key Insight

SGD trades exact gradient for speed and scalability
Mini‑batch provides a balance between:
- Stability 📊
- Speed ⚡
Widely used in modern machine learning and deep learning

Machince Learning, Machince Learning - Optimization

Machince Learning Overview ML Supervised Classification MLE Optimization Gradient Descent

This post is licensed under CC BY 4.0 by the author.