Post

Transformer-based Semantic Segmentation

Transformer-based Semantic Segmentation

๐Ÿง  Transformer-based Semantic Segmentation

SETR ยท Segmenter ยท DPT โ€“ Complete Mathematical & Architectural Notes


1๏ธโƒฃ SETR (SEgmentation TRansformer)

๐Ÿ“„ Paper: https://arxiv.org/abs/2012.15840

๐Ÿ”ท Core Idea

First work applying pure ViT encoder to semantic segmentation.

Instead of CNN encoder + decoder, SETR uses:

\[\textbf{Image} \rightarrow \textbf{Patch Embedding} \rightarrow \textbf{Transformer Encoder} \rightarrow \textbf{Light Decoder}\]

๐Ÿ”ท Patch Embedding

Input:

\[X \in \mathbb{R}^{H \times W \times C}\]

Split into patches of size:

\[P \times P\]

Number of patches:

\[N = \frac{HW}{P^2}\]

Flatten each patch:

\[x_i \in \mathbb{R}^{P^2C}\]

Linear projection:

\[z_i = W_E x_i\]

Where:

\[W_E \in \mathbb{R}^{D \times (P^2C)}\]

Add positional encoding:

\[z_i^{(0)} = z_i + p_i\]

๐Ÿ”ท Transformer Encoder

For layer โ„“:

Self-attention:

\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{D}}\right)V\]

With:

\[Q = ZW_Q,\quad K = ZW_K,\quad V = ZW_V\]

Stack L layers:

\[Z_L \in \mathbb{R}^{N \times D}\]

Each token contains global contextualized information.


2๏ธโƒฃ SETR Decoders

๐ŸŸข Naive Decoder

  • 1ร—1 Conv
  • Bilinear upsampling
  • Pixel-wise cross-entropy

๐Ÿ”ต Progressive Upsampling (PUP)

Alternating:

Conv โ†’ Upsample โ†’ Conv โ†’ Upsample

Gradually restore resolution.

๐Ÿ”ด Multi-Level Aggregation (MLA)

Use features from multiple layers:

\[Z^{(6)}, Z^{(12)}, Z^{(18)}, Z^{(24)}\]

Since same size:

\[Z_{agg} = \sum_i Z^{(i)}\]

Then upsample.

Best performance: MLA > PUP > Naive


3๏ธโƒฃ Segmenter

๐Ÿ“„ Paper: https://arxiv.org/abs/2105.05633

๐Ÿ”ท Encoder

Standard ViT:

\[Z_L \in \mathbb{R}^{N \times D}\]

๐Ÿ”ท Mask Transformer Decoder

Introduce:

\[K \text{ learnable class embeddings}\]

Append to patch tokens:

\[Z' = [Z_L ; C]\]

Run another Transformer.

Compute mask logits via dot-product:

\[M = Z_L C^T\]

Where:

\[M \in \mathbb{R}^{N \times K}\]

Reshape to 2D and upsample.

Apply softmax over classes.


๐Ÿ”Ž Impact of Patch Size

Smaller patch size:

  • Better spatial precision
  • Higher memory
  • Higher compute

Trade-off:

Patch Size Precision Compute โ€”โ€”โ€”โ€” โ€”โ€”โ€”โ€“ โ€”โ€”โ€”- 32ร—32 Low Fast 16ร—16 Medium Balanced 8ร—8 High Heavy


4๏ธโƒฃ DPT (Dense Prediction Transformer)

๐Ÿ“„ Paper: https://arxiv.org/abs/2103.13413

๐Ÿ”ท Core Difference

Instead of single-scale tokens:

Reassemble features at multiple resolutions.


๐Ÿ”ท Reassemble Block

Transform token sequence:

\[Z \in \mathbb{R}^{N \times D}\]

Back to spatial grid:

\[Z \rightarrow F \in \mathbb{R}^{H' \times W' \times D}\]

Project via convolution to different scales.

Multi-scale fusion similar to FPN.


๐Ÿ”ท CLS Token Handling

Options:

  • Ignore
  • Add to features
  • Project via MLP

5๏ธโƒฃ Applications

โœ” Semantic segmentation
โœ” Depth estimation
โœ” Dense prediction tasks


6๏ธโƒฃ Comparison

Model Encoder Decoder Type Multi-scale Global Context โ€”โ€”โ€”โ€“ โ€”โ€”โ€” โ€”โ€”โ€”โ€”โ€”โ€”โ€“ โ€”โ€”โ€”โ€”- โ€”โ€”โ€”โ€”โ€”- SETR ViT Simple / PUP / MLA โŒ โœ… Segmenter ViT Mask Transformer โŒ โœ… DPT ViT Multi-scale fusion โœ… โœ…


7๏ธโƒฃ Mathematical Insight

CNN segmentation:

\[\text{Local receptive field}\]

Transformer segmentation:

\[\text{Global receptive field}\]

Self-attention complexity:

\[\mathcal{O}(N^2)\]

Where:

\[N = \frac{HW}{P^2}\]

Thus smaller patches โ†’ quadratic explosion.


๐Ÿ”ฅ Key Takeaways

  1. SETR proved ViT works for segmentation.
  2. Segmenter introduced class embeddings for masks.
  3. DPT solved multi-scale issue.
  4. Patch size controls resolution vs compute.
  5. Transformer gives global context without convolution.
This post is licensed under CC BY 4.0 by the author.