Post

YOLO & SSD

YOLO & SSD

๐Ÿš€ YOLO & SSD: Complete Mathematical & Conceptual Guide


๐ŸŸข 1. YOLO (You Only Look Once) โ€” Core Philosophy

Unlike R-CNN family (two-stage detection), YOLO is:

๐ŸŽฏ Single-stage, fully convolutional, end-to-end detector

It reframes detection as a single regression problem from image pixels to bounding boxes + class probabilities.


๐Ÿงญ 2. YOLO v1: Grid-Based Detection

2.1 Image Partitioning

Input image:

\[I \in \mathbb{R}^{H \times W \times 3}\]

Divide into:

\[S \times S \text{ grid}\]

Original paper:

\[S = 7\]

If object center falls inside a grid cell โ†’ that cell is responsible.


2.2 Per-Cell Prediction

Each grid cell predicts:

  • B bounding boxes
  • C class probabilities

Original paper:

\[B = 2\]

Each bounding box predicts:

\[(x, y, w, h, C)\]

Where:

  • $(x,y)$ = center (relative to cell)
  • $(w,h)$ = width & height (relative to image)
  • $C$ = confidence

2.3 Confidence Definition

\[C = P(\text{object}) \times IoU(\text{pred}, GT)\]

If no object:

\[P(\text{object}) = 0\]

2.4 Output Tensor Size

Total output dimension:

\[S \times S \times (5B + C)\]

For VOC (C=20):

\[7 \times 7 \times (5*2 + 20) = 7 \times 7 \times 30\]

๐Ÿ”ฅ 3. YOLO Inference Pipeline

1๏ธโƒฃ Predict boxes for all grid cells
2๏ธโƒฃ Compute class-specific scores:

\[Score = P(c | object) \times C\]

3๏ธโƒฃ Apply Non-Max Suppression (NMS)


๐Ÿงฎ 4. Non-Max Suppression (NMS)

Algorithm:

  1. Select box with highest confidence
  2. Compute IoU with all other boxes
  3. Remove boxes with:
\[IoU > \theta\]
  1. Repeat

4.1 IoU Formula

\[IoU = \frac{Area(B_1 \cap B_2)}{Area(B_1 \cup B_2)}\]

Intersection width:

\[w = \max(0, \min(x_2^r, x_1^r) - \max(x_2^l, x_1^l))\]

Intersection height:

\[h = \max(0, \min(y_2^b, y_1^b) - \max(y_2^t, y_1^t))\]

๐Ÿงฎ 5. YOLO v1 Loss Function (Full Expansion)

Full loss:

\[\lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2]\] \[+ \lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]\] \[+ \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} (C_i - \hat{C}_i)^2\] \[+ \lambda_{noobj} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{noobj} (C_i - \hat{C}_i)^2\] \[+ \sum_{i=1}^{S^2} 1_i^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2\]

5.1 Indicator Variables

\[1_{ij}^{obj} = 1 \text{ if j-th box in cell i responsible}\] \[1_{ij}^{noobj} = 1 \text{ if no object}\]

5.2 Why sqrt for width/height?

Because:

\[\frac{d}{dw} \sqrt{w} = \frac{1}{2\sqrt{w}}\]

Small objects receive larger gradients.


5.3 Weight Hyperparameters

Original:

\[\lambda_{coord} = 5\] \[\lambda_{noobj} = 0.5\]

๐Ÿ“Š 6. YOLO Strengths & Weaknesses

โœ… Strengths

  • Real-time (~45 FPS)
  • Global context
  • Single forward pass

โŒ Weaknesses

  • Localization errors
  • Struggles with small objects
  • Grid constraints

๐ŸŸฃ 7. SSD (Single Shot MultiBox Detector)

SSD improves YOLO by:

  • Multi-scale feature maps
  • Anchor boxes at multiple layers

๐Ÿง  8. SSD Core Idea

Use multiple feature maps:

\[Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2\]

Each predicts default boxes (anchors).


๐Ÿงฎ 9. SSD Bounding Box Parameterization

Given default box:

\[d = (d_x, d_y, d_w, d_h)\]

Ground truth:

\[g = (g_x, g_y, g_w, g_h)\]

Targets:

\[t_x = \frac{g_x - d_x}{d_w}\] \[t_y = \frac{g_y - d_y}{d_h}\] \[t_w = \log\frac{g_w}{d_w}\] \[t_h = \log\frac{g_h}{d_h}\]

๐Ÿงฎ 10. SSD Loss Function

\[L(x,c,l,g) = \frac{1}{N} \left(L_{conf}(x,c) + \alpha L_{loc}(x,l,g)\right)\]

10.1 Localization Loss

Smooth L1:

\[L_{loc} = \sum SmoothL1(l_i - g_i)\]

10.2 Confidence Loss

Softmax:

\[L_{conf} = - \sum x_{ij} \log \hat{c}_i\]

Hard negative mining used.


๐Ÿ“Š 11. SSD vs YOLO vs Faster R-CNN

ModelTypeFPSAccuracy
Faster R-CNNTwo-stage5โ€“17High
YOLO v1Single-stage45Medium
SSDSingle-stage22โ€“59High

๐Ÿงฌ 12. Evolution Summary

๐Ÿ”ต R-CNN โ†’ Accurate but slow
๐ŸŸข YOLO โ†’ Fast but localization weak
๐ŸŸฃ SSD โ†’ Balance of speed + accuracy


๐Ÿ Final Takeaway

YOLO introduced:

  • End-to-end regression
  • Grid-based prediction
  • Real-time detection

SSD introduced:

  • Multi-scale detection
  • Anchor-based single-stage detection

Both paved way for:

  • YOLOv3โ€“v8
  • RetinaNet
  • EfficientDet
  • Anchor-free models
This post is licensed under CC BY 4.0 by the author.