Post

Detection Transformer (DETR)

Detection Transformer (DETR)

๐Ÿš€ Detection Transformer (DETR)


๐ŸŸข 1. Motivation

Traditional detectors:

  • R-CNN family โ†’ Anchor-based, NMS required
  • YOLO / SSD โ†’ Dense prediction + NMS

โ— Problems:

  • Hand-designed anchors
  • IoU thresholds
  • Non-Max Suppression
  • Heuristic post-processing

DETR removes all of them.


๐Ÿง  2. Core Idea of DETR

DETR formulates detection as a set prediction problem.

Instead of predicting many boxes and filtering with NMS, it directly predicts a fixed-size set of objects using:

  • Transformer encoder-decoder
  • Bipartite matching loss
  • No anchors
  • No NMS

๐Ÿ— 3. Architecture Overview

3.1 Backbone

CNN extracts feature map:

\[F \in \mathbb{R}^{H' \times W' \times C}\]

Flatten spatial dimensions:

\[F \rightarrow X \in \mathbb{R}^{(H'W') \times C}\]

Add positional encoding:

\[X = X + PE\]

3.2 Transformer Encoder

Self-attention:

\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Encoder output:

\[Z \in \mathbb{R}^{(H'W') \times C}\]

This encodes global context.


3.3 Object Queries (Decoder Input)

Instead of sequence tokens, DETR uses N learnable object queries:

\[Q_{obj} \in \mathbb{R}^{N \times C}\]

Typically:

\[N = 100\]

These queries ask:

โ€œIs there an object corresponding to me?โ€


3.4 Transformer Decoder

Cross-attention:

\[\text{Attention}(Q_{obj}, Z, Z)\]

Output embeddings:

\[E \in \mathbb{R}^{N \times C}\]

Each embedding corresponds to one object prediction.


๐ŸŽฏ 4. Prediction Heads

Each embedding passes through FFN:

Classification:

\[\hat{p}_i = \text{softmax}(W_c e_i)\]

Bounding box regression:

\[\hat{b}_i = \sigma(W_b e_i)\]

Bounding box format:

\[(x, y, w, h)\]

All normalized to [0,1].


๐Ÿ”ฅ 5. Bipartite Matching

Let:

Predictions: ${\hat{y}i}{i=1}^N$
Ground truth: ${y_j}_{j=1}^M$

Where:

\[M \le N\]

Optimal assignment:

\[\hat{\sigma} = \arg\min_{\sigma \in S_N} \sum_{i=1}^{M} \mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)})\]

Solved with Hungarian algorithm.


๐Ÿงฎ 6. Matching Cost

\[\mathcal{L}_{match} = - \log \hat{p}_{\sigma(i)}(c_i) + \lambda_{box} \mathcal{L}_{box}\]

Box loss:

\[\mathcal{L}_{L1} = \|b_i - \hat{b}_{\sigma(i)}\|_1\] \[\mathcal{L}_{GIoU} = 1 - GIoU(b_i, \hat{b}_{\sigma(i)})\]

๐Ÿ“ฆ 7. Final Loss

\[\mathcal{L} = \sum_{i=1}^{N} \Big[ \mathcal{L}_{cls} + \lambda_{box} \mathcal{L}_{L1} + \lambda_{giou} \mathcal{L}_{GIoU} \Big]\]

Unmatched predictions classified as โ€œno objectโ€.


โš  8. Limitations

  • Slow convergence (500 epochs)
  • Weak on small objects
  • No multi-scale in original

๐Ÿ 9. Conclusion

DETR:

  • Removes anchors
  • Removes NMS
  • Uses global set loss
  • End-to-end transformer detection
This post is licensed under CC BY 4.0 by the author.