Detection Transformer (DETR)

Posted Feb 23, 2026

1 min read

🚀 Detection Transformer (DETR)

🟢 1. Motivation

Traditional detectors:

R-CNN family → Anchor-based, NMS required
YOLO / SSD → Dense prediction + NMS

❗ Problems:

Hand-designed anchors
IoU thresholds
Non-Max Suppression
Heuristic post-processing

DETR removes all of them.

🧠 2. Core Idea of DETR

DETR formulates detection as a set prediction problem.

Instead of predicting many boxes and filtering with NMS, it directly predicts a fixed-size set of objects using:

Transformer encoder-decoder
Bipartite matching loss
No anchors
No NMS

🏗 3. Architecture Overview

3.1 Backbone

CNN extracts feature map:

\[F \in \mathbb{R}^{H' \times W' \times C}\]

Flatten spatial dimensions:

\[F \rightarrow X \in \mathbb{R}^{(H'W') \times C}\]

Add positional encoding:

\[X = X + PE\]

3.2 Transformer Encoder

Self-attention:

\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Encoder output:

\[Z \in \mathbb{R}^{(H'W') \times C}\]

This encodes global context.

3.3 Object Queries (Decoder Input)

Instead of sequence tokens, DETR uses N learnable object queries:

\[Q_{obj} \in \mathbb{R}^{N \times C}\]

Typically:

\[N = 100\]

These queries ask:

“Is there an object corresponding to me?”

3.4 Transformer Decoder

Cross-attention:

\[\text{Attention}(Q_{obj}, Z, Z)\]

Output embeddings:

\[E \in \mathbb{R}^{N \times C}\]

Each embedding corresponds to one object prediction.

🎯 4. Prediction Heads

Each embedding passes through FFN:

Classification:

\[\hat{p}_i = \text{softmax}(W_c e_i)\]

Bounding box regression:

\[\hat{b}_i = \sigma(W_b e_i)\]

Bounding box format:

\[(x, y, w, h)\]

All normalized to [0,1].

🔥 5. Bipartite Matching

Let:

Predictions: ${\hat{y}i}{i=1}^N$
Ground truth: ${y_j}_{j=1}^M$

Where:

\[M \le N\]

Optimal assignment:

\[\hat{\sigma} = \arg\min_{\sigma \in S_N} \sum_{i=1}^{M} \mathcal{L}_{match}(y_i, \hat{y}_{\sigma(i)})\]

Solved with Hungarian algorithm.

🧮 6. Matching Cost

\[\mathcal{L}_{match} = - \log \hat{p}_{\sigma(i)}(c_i) + \lambda_{box} \mathcal{L}_{box}\]

Box loss:

\[\mathcal{L}_{L1} = \|b_i - \hat{b}_{\sigma(i)}\|_1\] \[\mathcal{L}_{GIoU} = 1 - GIoU(b_i, \hat{b}_{\sigma(i)})\]

📦 7. Final Loss

\[\mathcal{L} = \sum_{i=1}^{N} \Big[ \mathcal{L}_{cls} + \lambda_{box} \mathcal{L}_{L1} + \lambda_{giou} \mathcal{L}_{GIoU} \Big]\]

Unmatched predictions classified as “no object”.

⚠ 8. Limitations

Slow convergence (500 epochs)
Weak on small objects
No multi-scale in original

🏁 9. Conclusion

DETR:

Removes anchors
Removes NMS
Uses global set loss
End-to-end transformer detection

Artificial Intelligence, Artificial Intelligence - Model

Artificial Intelligenc DETR

This post is licensed under CC BY 4.0 by the author.