Transformer-based Semantic Segmentation
๐ง Transformer-based Semantic Segmentation
SETR ยท Segmenter ยท DPT โ Complete Mathematical & Architectural Notes
1๏ธโฃ SETR (SEgmentation TRansformer)
๐ Paper: https://arxiv.org/abs/2012.15840
๐ท Core Idea
First work applying pure ViT encoder to semantic segmentation.
Instead of CNN encoder + decoder, SETR uses:
\[\textbf{Image} \rightarrow \textbf{Patch Embedding} \rightarrow \textbf{Transformer Encoder} \rightarrow \textbf{Light Decoder}\]๐ท Patch Embedding
Input:
\[X \in \mathbb{R}^{H \times W \times C}\]Split into patches of size:
\[P \times P\]Number of patches:
\[N = \frac{HW}{P^2}\]Flatten each patch:
\[x_i \in \mathbb{R}^{P^2C}\]Linear projection:
\[z_i = W_E x_i\]Where:
\[W_E \in \mathbb{R}^{D \times (P^2C)}\]Add positional encoding:
\[z_i^{(0)} = z_i + p_i\]๐ท Transformer Encoder
For layer โ:
Self-attention:
\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{D}}\right)V\]With:
\[Q = ZW_Q,\quad K = ZW_K,\quad V = ZW_V\]Stack L layers:
\[Z_L \in \mathbb{R}^{N \times D}\]Each token contains global contextualized information.
2๏ธโฃ SETR Decoders
๐ข Naive Decoder
- 1ร1 Conv
- Bilinear upsampling
- Pixel-wise cross-entropy
๐ต Progressive Upsampling (PUP)
Alternating:
Conv โ Upsample โ Conv โ Upsample
Gradually restore resolution.
๐ด Multi-Level Aggregation (MLA)
Use features from multiple layers:
\[Z^{(6)}, Z^{(12)}, Z^{(18)}, Z^{(24)}\]Since same size:
\[Z_{agg} = \sum_i Z^{(i)}\]Then upsample.
Best performance: MLA > PUP > Naive
3๏ธโฃ Segmenter
๐ Paper: https://arxiv.org/abs/2105.05633
๐ท Encoder
Standard ViT:
\[Z_L \in \mathbb{R}^{N \times D}\]๐ท Mask Transformer Decoder
Introduce:
\[K \text{ learnable class embeddings}\]Append to patch tokens:
\[Z' = [Z_L ; C]\]Run another Transformer.
Compute mask logits via dot-product:
\[M = Z_L C^T\]Where:
\[M \in \mathbb{R}^{N \times K}\]Reshape to 2D and upsample.
Apply softmax over classes.
๐ Impact of Patch Size
Smaller patch size:
- Better spatial precision
- Higher memory
- Higher compute
Trade-off:
Patch Size Precision Compute โโโโ โโโโ โโโ- 32ร32 Low Fast 16ร16 Medium Balanced 8ร8 High Heavy
4๏ธโฃ DPT (Dense Prediction Transformer)
๐ Paper: https://arxiv.org/abs/2103.13413
๐ท Core Difference
Instead of single-scale tokens:
Reassemble features at multiple resolutions.
๐ท Reassemble Block
Transform token sequence:
\[Z \in \mathbb{R}^{N \times D}\]Back to spatial grid:
\[Z \rightarrow F \in \mathbb{R}^{H' \times W' \times D}\]Project via convolution to different scales.
Multi-scale fusion similar to FPN.
๐ท CLS Token Handling
Options:
- Ignore
- Add to features
- Project via MLP
5๏ธโฃ Applications
โ Semantic segmentation
โ Depth estimation
โ Dense prediction tasks
6๏ธโฃ Comparison
Model Encoder Decoder Type Multi-scale Global Context โโโโ โโโ โโโโโโโ โโโโ- โโโโโ- SETR ViT Simple / PUP / MLA โ โ Segmenter ViT Mask Transformer โ โ DPT ViT Multi-scale fusion โ โ
7๏ธโฃ Mathematical Insight
CNN segmentation:
\[\text{Local receptive field}\]Transformer segmentation:
\[\text{Global receptive field}\]Self-attention complexity:
\[\mathcal{O}(N^2)\]Where:
\[N = \frac{HW}{P^2}\]Thus smaller patches โ quadratic explosion.
๐ฅ Key Takeaways
- SETR proved ViT works for segmentation.
- Segmenter introduced class embeddings for masks.
- DPT solved multi-scale issue.
- Patch size controls resolution vs compute.
- Transformer gives global context without convolution.