Detection Transformer (DETR)
๐ Detection Transformer (DETR)
๐ข 1. Motivation
Traditional detectors:
- R-CNN family โ Anchor-based, NMS required
- YOLO / SSD โ Dense prediction + NMS
โ Problems:
- Hand-designed anchors
- IoU thresholds
- Non-Max Suppression
- Heuristic post-processing
DETR removes all of them.
๐ง 2. Core Idea of DETR
DETR formulates detection as a set prediction problem.
Instead of predicting many boxes and filtering with NMS, it directly predicts a fixed-size set of objects using:
- Transformer encoder-decoder
- Bipartite matching loss
- No anchors
- No NMS
๐ 3. Architecture Overview
3.1 Backbone
CNN extracts feature map:
\[F \in \mathbb{R}^{H' \times W' \times C}\]Flatten spatial dimensions:
\[F \rightarrow X \in \mathbb{R}^{(H'W') \times C}\]Add positional encoding:
\[X = X + PE\]3.2 Transformer Encoder
Self-attention:
\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Where:
\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]Encoder output:
\[Z \in \mathbb{R}^{(H'W') \times C}\]This encodes global context.
3.3 Object Queries (Decoder Input)
Instead of sequence tokens, DETR uses N learnable object queries:
\[Q_{obj} \in \mathbb{R}^{N \times C}\]Typically:
\[N = 100\]These queries ask:
โIs there an object corresponding to me?โ
3.4 Transformer Decoder
Cross-attention:
\[\text{Attention}(Q_{obj}, Z, Z)\]Output embeddings:
\[E \in \mathbb{R}^{N \times C}\]Each embedding corresponds to one object prediction.
๐ฏ 4. Prediction Heads
Each embedding passes through FFN:
Classification:
\[\hat{p}_i = \text{softmax}(W_c e_i)\]Bounding box regression:
\[\hat{b}_i = \sigma(W_b e_i)\]Bounding box format:
\[(x, y, w, h)\]All normalized to [0,1].
๐ฅ 5. Bipartite Matching
Let:
Predictions: ${\hat{y}i}{i=1}^N$
Ground truth: ${y_j}_{j=1}^M$
Where:
\[M \le N\]Optimal assignment:
\[\hat{\sigma} = \arg\min_{\sigma \in S_N} \sum_{i=1}^{M} \mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)})\]Solved with Hungarian algorithm.
๐งฎ 6. Matching Cost
\[\mathcal{L}_{match} = - \log \hat{p}_{\sigma(i)}(c_i) + \lambda_{box} \mathcal{L}_{box}\]Box loss:
\[\mathcal{L}_{L1} = \|b_i - \hat{b}_{\sigma(i)}\|_1\] \[\mathcal{L}_{GIoU} = 1 - GIoU(b_i, \hat{b}_{\sigma(i)})\]๐ฆ 7. Final Loss
\[\mathcal{L} = \sum_{i=1}^{N} \Big[ \mathcal{L}_{cls} + \lambda_{box} \mathcal{L}_{L1} + \lambda_{giou} \mathcal{L}_{GIoU} \Big]\]Unmatched predictions classified as โno objectโ.
โ 8. Limitations
- Slow convergence (500 epochs)
- Weak on small objects
- No multi-scale in original
๐ 9. Conclusion
DETR:
- Removes anchors
- Removes NMS
- Uses global set loss
- End-to-end transformer detection