R-CNN

Posted Feb 23, 2026

3 min read

R-CNN

🚀 Object Detection Evolution: From R-CNN to Faster R-CNN

🧭 1. Object Detection: Formal Problem Definition

Given an image:

\[I \in \mathbb{R}^{H \times W \times 3}\]

Goal: predict a variable-length set:

\[\mathcal{D} = \{(c_i, b_i, s_i)\}_{i=1}^{N}\]

Where:

🎯 $c_i$ : class label
📦 $b_i = (x_i, y_i, w_i, h_i)$ : bounding box (center format)
🔥 $s_i$ : confidence score

We must solve simultaneously:

1️⃣ Classification
2️⃣ Localization (Regression)

🧠 2. Detection = Classification + Localization

CNN backbone produces feature:

\[f = \phi(I)\]

2.1 Classification Head

\[p = \mathrm{softmax}(W_c f)\]

Cross entropy:

\[L_{cls} = - \sum_{k=1}^{C} y_k \log p_k\]

2.2 Bounding Box Regression

Predict:

\[\hat{b} = W_r f\]

Naive L2 loss:

\[L_{reg} = \sum_{j \in \{x,y,w,h\}} (b_j - \hat{b}_j)^2\]

2.3 Multi‑Task Learning

\[L = L_{cls} + \lambda L_{reg}\]

📌 This multi-task formulation is foundational for all R-CNN family models.

🏗 3. R‑CNN (2014)

Architecture

1️⃣ Region proposal (Selective Search)
2️⃣ ~2000 cropped regions
3️⃣ CNN forward pass per region
4️⃣ SVM classifier + box regressor

❌ Computational Bottleneck

If CNN forward cost = F

Total cost:

\[2000F\]

This makes inference extremely slow (~0.5 FPS).

📦 4. Bounding Box Regression (Core Derivation)

Given proposal:

\[p = (p_x, p_y, p_w, p_h)\]

Ground truth:

\[g = (g_x, g_y, g_w, g_h)\]

We define transformation targets:

\[t_x = \frac{g_x - p_x}{p_w}\] \[t_y = \frac{g_y - p_y}{p_h}\] \[t_w = \log \frac{g_w}{p_w}\] \[t_h = \log \frac{g_h}{p_h}\]

🔎 Why log?
Because width & height are positive and scaling invariant.

Reconstruction:

\[\hat{g}_x = p_x + p_w \hat{t}_x\] \[\hat{g}_w = p_w \exp(\hat{t}_w)\]

📊 5. IoU (Intersection over Union)

\[IoU = \frac{|B_p \cap B_g|}{|B_p \cup B_g|}\]

Expanded:

\[A_{union} = A_p + A_g - A_{intersection}\]

IoU determines positive/negative samples.

⚡ 6. Fast R‑CNN (2015)

🔥 Key Improvement

Single CNN forward pass per image.

Instead of cropping 2000 regions, we:

1️⃣ Compute conv feature map once
2️⃣ Project proposals onto feature map
3️⃣ Apply RoI Pooling

🧩 RoI Pooling Mathematics

Given region projected to feature map:

Divide into $k \times k$ bins.

Each bin applies:

\[y_{ij} = \max_{(x,y) \in \text{bin}_{ij}} f(x,y)\]

Output fixed-size feature.

🧮 Fast R-CNN Loss

\[L = L_{cls} + \lambda [u \ge 1] L_{reg}\]

Where u = ground truth class.

📉 Smooth L1 Loss

\[SmoothL1(x) = \begin{cases} 0.5x^2 & |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases}\]

Why?

Quadratic near zero (stable gradients)
Linear for large errors (robust)

🧠 Difference: R‑CNN vs Fast R‑CNN

Feature	R‑CNN	Fast R‑CNN
CNN Passes	~2000 per image	1 per image
Speed	Very slow	80‑200× faster
Memory	High	Much lower
Training	Multi‑stage	End‑to‑end

🚀 7. Faster R‑CNN (2015)

💡 Main Idea

Replace Selective Search with learnable Region Proposal Network (RPN).

📌 8. Anchors

At each feature map location:

Define k anchors.

Example:

3 aspect ratios
3 scales

Total:

\[k = 9\]

🎯 Anchor Classification

Positive:

\[IoU > 0.7\]

Negative:

\[IoU < 0.3\]

🧮 9. RPN Output Size

Per location:

2k classification scores
4k regression outputs

Total:

\[2k + 4k\]

🧮 10. RPN Loss (Full Expression)

Let:

$p_i$ = predicted objectness
$p_i^*$ = ground truth label
$t_i$ = predicted box
$t_i^*$ = ground truth transform

Full loss:

\[L = \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum_i p_i^* L_{reg}(t_i, t_i^*)\]

Where:

\[L_{cls} = - p_i^* \log p_i - (1 - p_i^*) \log(1 - p_i)\]

Regression applied only when $p_i^* = 1$.

🧠 11. Faster R‑CNN Training Strategy

4‑Step Alternating Training:

1️⃣ Train RPN
2️⃣ Train Fast R‑CNN detector
3️⃣ Share conv layers & fine‑tune RPN
4️⃣ Fine‑tune detector

Later simplified into joint training.

🎯 12. RoI Align (Mask R‑CNN)

Problem with RoI Pooling:

Quantization error.

Solution:

Use bilinear interpolation.

Given sample point (x,y):

\[f(x,y) = \sum_{i,j} w_{ij} f(x_i, y_j)\]

Weights proportional to distance.

Removes rounding → improves mask accuracy.

📊 13. Speed Comparison

Model	FPS
R‑CNN	~0.5
Fast R‑CNN	~0.5–2
Faster R‑CNN	5–17

Speed gain comes from removing external proposal.

🧬 14. Evolution Summary

🟥 R‑CNN

Accurate
Extremely slow
Not end‑to‑end

🟧 Fast R‑CNN

Shared conv backbone
RoI pooling
Still external proposal

🟩 Faster R‑CNN

RPN integrated
End‑to‑end
Anchor-based detection
Practical speed

🏁 Final Takeaway

The R‑CNN family progressively:

Reduced redundant CNN computation
Integrated proposal generation
Unified classification + regression
Improved robustness with Smooth L1
Introduced anchor parameterization

This evolution laid the foundation for modern detectors like:

YOLO
SSD
RetinaNet
DETR

Artificial Intelligence, Artificial Intelligence - Model

Artificial Intelligenc R-CNN

This post is licensed under CC BY 4.0 by the author.