YOLO & SSD

Posted Feb 23, 2026

3 min read

YOLO & SSD

🚀 YOLO & SSD: Complete Mathematical & Conceptual Guide

🟢 1. YOLO (You Only Look Once) — Core Philosophy

Unlike R-CNN family (two-stage detection), YOLO is:

🎯 Single-stage, fully convolutional, end-to-end detector

It reframes detection as a single regression problem from image pixels to bounding boxes + class probabilities.

🧭 2. YOLO v1: Grid-Based Detection

2.1 Image Partitioning

Input image:

\[I \in \mathbb{R}^{H \times W \times 3}\]

Divide into:

\[S \times S \text{ grid}\]

Original paper:

\[S = 7\]

If object center falls inside a grid cell → that cell is responsible.

2.2 Per-Cell Prediction

Each grid cell predicts:

B bounding boxes
C class probabilities

Original paper:

\[B = 2\]

Each bounding box predicts:

\[(x, y, w, h, C)\]

Where:

$(x,y)$ = center (relative to cell)
$(w,h)$ = width & height (relative to image)
$C$ = confidence

2.3 Confidence Definition

\[C = P(\text{object}) \times IoU(\text{pred}, GT)\]

If no object:

\[P(\text{object}) = 0\]

2.4 Output Tensor Size

Total output dimension:

\[S \times S \times (5B + C)\]

For VOC (C=20):

\[7 \times 7 \times (5*2 + 20) = 7 \times 7 \times 30\]

🔥 3. YOLO Inference Pipeline

1️⃣ Predict boxes for all grid cells
2️⃣ Compute class-specific scores:

\[Score = P(c | object) \times C\]

3️⃣ Apply Non-Max Suppression (NMS)

🧮 4. Non-Max Suppression (NMS)

Algorithm:

Select box with highest confidence
Compute IoU with all other boxes
Remove boxes with:

\[IoU > \theta\]

Repeat

4.1 IoU Formula

\[IoU = \frac{Area(B_1 \cap B_2)}{Area(B_1 \cup B_2)}\]

Intersection width:

\[w = \max(0, \min(x_2^r, x_1^r) - \max(x_2^l, x_1^l))\]

Intersection height:

\[h = \max(0, \min(y_2^b, y_1^b) - \max(y_2^t, y_1^t))\]

🧮 5. YOLO v1 Loss Function (Full Expansion)

Full loss:

\[\lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2]\] \[+ \lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]\] \[+ \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} (C_i - \hat{C}_i)^2\] \[+ \lambda_{noobj} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{noobj} (C_i - \hat{C}_i)^2\] \[+ \sum_{i=1}^{S^2} 1_i^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2\]

5.1 Indicator Variables

\[1_{ij}^{obj} = 1 \text{ if j-th box in cell i responsible}\] \[1_{ij}^{noobj} = 1 \text{ if no object}\]

5.2 Why sqrt for width/height?

Because:

\[\frac{d}{dw} \sqrt{w} = \frac{1}{2\sqrt{w}}\]

Small objects receive larger gradients.

5.3 Weight Hyperparameters

Original:

\[\lambda_{coord} = 5\] \[\lambda_{noobj} = 0.5\]

📊 6. YOLO Strengths & Weaknesses

✅ Strengths

Real-time (~45 FPS)
Global context
Single forward pass

❌ Weaknesses

Localization errors
Struggles with small objects
Grid constraints

🟣 7. SSD (Single Shot MultiBox Detector)

SSD improves YOLO by:

Multi-scale feature maps
Anchor boxes at multiple layers

🧠 8. SSD Core Idea

Use multiple feature maps:

\[Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2\]

Each predicts default boxes (anchors).

🧮 9. SSD Bounding Box Parameterization

Given default box:

\[d = (d_x, d_y, d_w, d_h)\]

Ground truth:

\[g = (g_x, g_y, g_w, g_h)\]

Targets:

\[t_x = \frac{g_x - d_x}{d_w}\] \[t_y = \frac{g_y - d_y}{d_h}\] \[t_w = \log\frac{g_w}{d_w}\] \[t_h = \log\frac{g_h}{d_h}\]

🧮 10. SSD Loss Function

\[L(x,c,l,g) = \frac{1}{N} \left(L_{conf}(x,c) + \alpha L_{loc}(x,l,g)\right)\]

10.1 Localization Loss

Smooth L1:

\[L_{loc} = \sum SmoothL1(l_i - g_i)\]

10.2 Confidence Loss

Softmax:

\[L_{conf} = - \sum x_{ij} \log \hat{c}_i\]

Hard negative mining used.

📊 11. SSD vs YOLO vs Faster R-CNN

Model	Type	FPS	Accuracy
Faster R-CNN	Two-stage	5–17	High
YOLO v1	Single-stage	45	Medium
SSD	Single-stage	22–59	High

🧬 12. Evolution Summary

🔵 R-CNN → Accurate but slow
🟢 YOLO → Fast but localization weak
🟣 SSD → Balance of speed + accuracy

🏁 Final Takeaway

YOLO introduced:

End-to-end regression
Grid-based prediction
Real-time detection

SSD introduced:

Multi-scale detection
Anchor-based single-stage detection

Both paved way for:

YOLOv3–v8
RetinaNet
EfficientDet
Anchor-free models

Artificial Intelligence, Artificial Intelligence - Model

This post is licensed under CC BY 4.0 by the author.