Post

Vision Transformer

Vision Transformer

Vision Transformer (ViT)

Patch Embedding

Given image:

\[X \in \mathbb{R}^{H \times W \times 3}\]

Patch size:

\[P \times P\]

Number of patches:

\[N = \frac{HW}{P^2}\]

Flatten each patch:

\[x_i \in \mathbb{R}^{P^2 \cdot 3}\]

Linear projection:

\[z_i = E x_i\]

where

\[E \in \mathbb{R}^{D \times (P^2 \cdot 3)}\]

Add class token and position encoding:

\[Z_0 = [z_{cls}, z_1, ..., z_N] + P_{pos}\]

Self-Attention

Given

\[Z \in \mathbb{R}^{(N+1) \times D}\]

Compute

\[Q = ZW_Q\] \[K = ZW_K\] \[V = ZW_V\]

Attention:

\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^T}{\sqrt{D}} \right) V\]

Transformer block:

\[Z' = Z + \mathrm{MHA}(\mathrm{LN}(Z))\] \[Z'' = Z' + \mathrm{MLP}(\mathrm{LN}(Z'))\]

DeiT

Add distillation token:

\[[z_{cls}, z_{dist}, z_1, ..., z_N]\]

Two heads:

  • Classification head
  • Distillation head

Soft-Label Distillation

Teacher output:

\[p_T = \mathrm{softmax} \left( \frac{z_T}{T} \right)\]

Student output:

\[p_S = \mathrm{softmax} \left( \frac{z_S}{T} \right)\]

Distillation loss:

\[\mathcal{L}_{distill} = \mathrm{KL}(p_T || p_S)\]

Full loss:

\[\mathcal{L} = (1-\alpha)\mathcal{L}_{CE} + \alpha T^2 \mathcal{L}_{distill}\]

Hard-Label Distillation

Teacher label:

\[y_T = \arg\max p_T\]

Distillation loss:

\[\mathcal{L}_{distill} = \mathrm{CE}(y_T, p_S)\]

Full loss:

\[\mathcal{L} = (1-\alpha)\mathcal{L}_{CE} + \alpha \mathcal{L}_{distill}\]

Swin Transformer

Hierarchical Structure

Before merging:

\[H \times W \times C\]

After patch merging:

\[\frac{H}{2} \times \frac{W}{2} \times 2C\]

Window Attention

Window size:

\[M \times M\]

Complexity:

\[O\left(\frac{HW}{M^2} \cdot M^2\right) = O(HW)\]

Relative Position Bias

Attention score:

\[A_{ij} = \frac{q_i k_j^T}{\sqrt{d}} + B_{(i-j)}\]

Convolutional Vision Transformer (CvT)

Convolutional embedding:

\[z = \mathrm{Conv}(X)\]

Convolutional Q, K, V:

\[Q = \mathrm{Conv}(Z)\] \[K = \mathrm{Conv}(Z)\] \[V = \mathrm{Conv}(Z)\]
This post is licensed under CC BY 4.0 by the author.