YOLO & SSD
๐ YOLO & SSD: Complete Mathematical & Conceptual Guide
๐ข 1. YOLO (You Only Look Once) โ Core Philosophy
Unlike R-CNN family (two-stage detection), YOLO is:
๐ฏ Single-stage, fully convolutional, end-to-end detector
It reframes detection as a single regression problem from image pixels to bounding boxes + class probabilities.
๐งญ 2. YOLO v1: Grid-Based Detection
2.1 Image Partitioning
Input image:
\[I \in \mathbb{R}^{H \times W \times 3}\]Divide into:
\[S \times S \text{ grid}\]Original paper:
\[S = 7\]If object center falls inside a grid cell โ that cell is responsible.
2.2 Per-Cell Prediction
Each grid cell predicts:
- B bounding boxes
- C class probabilities
Original paper:
\[B = 2\]Each bounding box predicts:
\[(x, y, w, h, C)\]Where:
- $(x,y)$ = center (relative to cell)
- $(w,h)$ = width & height (relative to image)
- $C$ = confidence
2.3 Confidence Definition
\[C = P(\text{object}) \times IoU(\text{pred}, GT)\]If no object:
\[P(\text{object}) = 0\]2.4 Output Tensor Size
Total output dimension:
\[S \times S \times (5B + C)\]For VOC (C=20):
\[7 \times 7 \times (5*2 + 20) = 7 \times 7 \times 30\]๐ฅ 3. YOLO Inference Pipeline
1๏ธโฃ Predict boxes for all grid cells
2๏ธโฃ Compute class-specific scores:
3๏ธโฃ Apply Non-Max Suppression (NMS)
๐งฎ 4. Non-Max Suppression (NMS)
Algorithm:
- Select box with highest confidence
- Compute IoU with all other boxes
- Remove boxes with:
- Repeat
4.1 IoU Formula
\[IoU = \frac{Area(B_1 \cap B_2)}{Area(B_1 \cup B_2)}\]Intersection width:
\[w = \max(0, \min(x_2^r, x_1^r) - \max(x_2^l, x_1^l))\]Intersection height:
\[h = \max(0, \min(y_2^b, y_1^b) - \max(y_2^t, y_1^t))\]๐งฎ 5. YOLO v1 Loss Function (Full Expansion)
Full loss:
\[\lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2]\] \[+ \lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]\] \[+ \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{obj} (C_i - \hat{C}_i)^2\] \[+ \lambda_{noobj} \sum_{i=1}^{S^2} \sum_{j=1}^{B} 1_{ij}^{noobj} (C_i - \hat{C}_i)^2\] \[+ \sum_{i=1}^{S^2} 1_i^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2\]5.1 Indicator Variables
\[1_{ij}^{obj} = 1 \text{ if j-th box in cell i responsible}\] \[1_{ij}^{noobj} = 1 \text{ if no object}\]5.2 Why sqrt for width/height?
Because:
\[\frac{d}{dw} \sqrt{w} = \frac{1}{2\sqrt{w}}\]Small objects receive larger gradients.
5.3 Weight Hyperparameters
Original:
\[\lambda_{coord} = 5\] \[\lambda_{noobj} = 0.5\]๐ 6. YOLO Strengths & Weaknesses
โ Strengths
- Real-time (~45 FPS)
- Global context
- Single forward pass
โ Weaknesses
- Localization errors
- Struggles with small objects
- Grid constraints
๐ฃ 7. SSD (Single Shot MultiBox Detector)
SSD improves YOLO by:
- Multi-scale feature maps
- Anchor boxes at multiple layers
๐ง 8. SSD Core Idea
Use multiple feature maps:
\[Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, Conv11_2\]Each predicts default boxes (anchors).
๐งฎ 9. SSD Bounding Box Parameterization
Given default box:
\[d = (d_x, d_y, d_w, d_h)\]Ground truth:
\[g = (g_x, g_y, g_w, g_h)\]Targets:
\[t_x = \frac{g_x - d_x}{d_w}\] \[t_y = \frac{g_y - d_y}{d_h}\] \[t_w = \log\frac{g_w}{d_w}\] \[t_h = \log\frac{g_h}{d_h}\]๐งฎ 10. SSD Loss Function
\[L(x,c,l,g) = \frac{1}{N} \left(L_{conf}(x,c) + \alpha L_{loc}(x,l,g)\right)\]10.1 Localization Loss
Smooth L1:
\[L_{loc} = \sum SmoothL1(l_i - g_i)\]10.2 Confidence Loss
Softmax:
\[L_{conf} = - \sum x_{ij} \log \hat{c}_i\]Hard negative mining used.
๐ 11. SSD vs YOLO vs Faster R-CNN
| Model | Type | FPS | Accuracy |
|---|---|---|---|
| Faster R-CNN | Two-stage | 5โ17 | High |
| YOLO v1 | Single-stage | 45 | Medium |
| SSD | Single-stage | 22โ59 | High |
๐งฌ 12. Evolution Summary
๐ต R-CNN โ Accurate but slow
๐ข YOLO โ Fast but localization weak
๐ฃ SSD โ Balance of speed + accuracy
๐ Final Takeaway
YOLO introduced:
- End-to-end regression
- Grid-based prediction
- Real-time detection
SSD introduced:
- Multi-scale detection
- Anchor-based single-stage detection
Both paved way for:
- YOLOv3โv8
- RetinaNet
- EfficientDet
- Anchor-free models