Post

Video Transformers

Video Transformers

Video Transformers

This note covers three important families of transformer-based video models:

1) ViViT (Video Vision Transformer) — Model 1 to Model 4
2) TimeSFormer (Time-Space Transformer)
3) Multiscale Vision Transformers (MViT / MViTv2 style ideas)

Throughout, I will avoid fragile inline LaTeX and use block equations only.


0) Problem Setup and Notation

A video clip is a tensor:

\[X \in \mathbb{R}^{T \times H \times W \times C}\]

Commonly:

  • C = 3 for RGB
  • T = number of frames
  • H, W = spatial resolution

Patch size:

\[P \times P\]

Spatial patches per frame:

\[N = \frac{H W}{P^2}\]

Tokens per clip (if we tokenize each frame into patches):

\[L = T \cdot N\]

Transformer hidden dimension:

\[D\]

We represent tokens as a sequence:

\[Z \in \mathbb{R}^{L \times D}\]

Standard attention (single head form):

\[Q = Z W_Q\] \[K = Z W_K\] \[V = Z W_V\] \[\mathrm{Attn}(Z) = \mathrm{softmax} \left( \frac{Q K^T}{\sqrt{D}} \right) V\]

Compute cost is dominated by:

\[\mathcal{O}(L^2)\]

So how we define L and how we factor attention matters a lot.


1) ViViT (Video Vision Transformer): Models 1–4

ViViT is a design space for adapting ViT to video.

The core challenge:

  • Video tokens explode: L = T*N can be huge
  • Full attention over L tokens is expensive
  • Need to model both appearance and motion / temporal dynamics

Below, “Model 1–4” are best understood as different factorization and aggregation strategies.


1.1 ViViT Tokenization (shared start)

Per-frame patch embedding:

Each frame at time t is:

\[X_t \in \mathbb{R}^{H \times W \times C}\]

Split into N patches, flatten each patch:

\[x_{t,i} \in \mathbb{R}^{P^2 \cdot C}\]

Linear embedding:

\[z_{t,i} = E x_{t,i}\]

Collect all tokens:

\[Z = \{ z_{t,i} \}\]

Video positional encoding usually has two components:

  • Spatial position within a frame
  • Temporal position across frames

A common additive form:

\[z_{t,i} \leftarrow z_{t,i} + p^{space}_i + p^{time}_t\]

1.2 ViViT Model 1: Spatio-Temporal Joint Attention (Full Attention)

Idea

Treat all tokens across space and time as one long sequence and apply standard transformer blocks.

Sequence length:

\[L = T \cdot N\]

Attention complexity:

\[\mathcal{O}\left( (T N)^2 \right)\]

Why it works

  • Directly models arbitrary interactions: any patch can attend to any other patch in any frame
  • Rich temporal-spatial reasoning capacity

Why it is expensive

If H=W=224, P=16:

\[N = \frac{224 \cdot 224}{16^2} = 196\]

If T=32:

\[L = 32 \cdot 196 = 6272\]

Then attention matrix is:

\[6272 \times 6272\]

That is huge in memory and compute.

When Model 1 makes sense

  • Short clips
  • Smaller resolution
  • Large compute budget
  • You want maximum flexibility and can afford it

1.3 ViViT Model 2: Factorized Space-Time Attention (Two-Stage Attention)

Idea

Replace joint attention with a factorization:

1) Spatial attention within each frame (independent across t)
2) Temporal attention across frames (for corresponding spatial tokens or for aggregated per-frame tokens)

This reduces cost by avoiding a full (TN) x (TN) matrix.

Step A: Spatial attention per frame

For each t, run attention over N tokens:

\[Z_t \in \mathbb{R}^{N \times D}\]

Cost per frame:

\[\mathcal{O}(N^2)\]

For T frames:

\[\mathcal{O}(T \cdot N^2)\]

Step B: Temporal attention

There are two common flavors.

Flavor B1: Token-wise temporal attention

For each spatial index i, attend across time:

\[Z_{:,i} \in \mathbb{R}^{T \times D}\]

Cost per spatial location:

\[\mathcal{O}(T^2)\]

For all N spatial positions:

\[\mathcal{O}(N \cdot T^2)\]

Total Model 2 cost:

\[\mathcal{O}(T N^2 + N T^2)\]

Compare to Model 1:

\[\mathcal{O}(T^2 N^2)\]

Model 2 is dramatically cheaper when T and N are not tiny.

Flavor B2: Temporal attention on per-frame pooled tokens

Instead of per-location temporal attention, summarize each frame with a token and attend over T summary tokens.

Per-frame summary token (class token or pooling):

\[s_t = \mathrm{Pool}(Z_t)\]

Sequence:

\[S \in \mathbb{R}^{T \times D}\]

Temporal attention cost:

\[\mathcal{O}(T^2)\]

Total is then mostly spatial:

\[\mathcal{O}(T N^2 + T^2)\]

Pros

  • Much cheaper than Model 1
  • Still captures temporal dynamics

Cons

  • Factorization imposes structural constraints
  • Some interactions (like patch i at time t attending to patch j at time t’) are only indirectly modeled

1.4 ViViT Model 3: Factorized Encoder with Temporal Tokenization / Pooling (Aggressive Temporal Compression)

Idea

Reduce temporal length early.

Common strategies:

1) Temporal pooling or striding on frame embeddings 2) Tubelet embedding (3D patch) to reduce tokens 3) Use temporal attention only on a small set of tokens

Tubelet embedding (common in video ViT designs)

Instead of 2D patches, use a 3D “tubelet” spanning t frames:

Tubelet size:

\[\tau \times P \times P\]

Now temporal axis is downsampled by factor \tau:

New temporal length:

\[T' = \frac{T}{\tau}\]

Tokens per clip:

\[L' = T' \cdot N\]

Attention cost becomes:

\[\mathcal{O}\left( (T' N)^2 \right)\]

which is smaller than Model 1 if T’ « T.

Pros

  • Significant compute reduction
  • Strong for longer clips

Cons

  • If \tau is large, you may lose fine motion details
  • Temporal aliasing can occur if content changes quickly between frames

1.5 ViViT Model 4: Factorized Attention + Token Pooling / Hierarchical Video Transformer

Idea

Build a hierarchical representation across layers (like CNN pyramids) so the model processes:

  • High resolution, many tokens early
  • Lower resolution, fewer tokens later

This is aligned with the “multiscale” philosophy (also used in MViT).

Common mechanisms:

1) Patch merging or pooling in space
2) Temporal pooling / merging across frames
3) Attention operating on progressively shorter sequences

Patch merging (concept)

Merge 2x2 spatial neighboring tokens into one:

Spatial resolution reduces by 2, channels increase (or projection changes).

If spatial tokens N correspond to H/P by W/P grid:

After merging:

\[N' = \frac{N}{4}\]

Similarly, temporal merging could reduce T:

\[T' = \frac{T}{2}\]

Total tokens reduce:

\[L' = T' \cdot N'\]

Why this matters

Attention cost scales with L^2, so token reduction is extremely powerful.

Pros

  • Scales to longer clips and higher resolution
  • Learns coarse-to-fine temporal-spatial representations

Cons

  • Requires careful design of merging rules
  • Some fine details may be lost at late stages

2) TimeSFormer (Time-Space Transformer)

TimeSFormer is closely related to ViViT-style factorization, but often explained with a simple, clean idea:

  • Separate attention into temporal attention and spatial attention
  • Alternate or compose them in each layer

2.1 TimeSFormer Factorized Attention

Let tokens be indexed by (t, i) where:

  • t: time index
  • i: spatial patch index

Temporal attention (per spatial position)

For each i, attend over t:

\[Z_{:,i} \in \mathbb{R}^{T \times D}\]

Temporal attention output:

\[\hat{Z}_{:,i} = \mathrm{Attn}(Z_{:,i})\]

Spatial attention (per time step)

For each t, attend over i:

\[Z_{t,:} \in \mathbb{R}^{N \times D}\]

Spatial attention output:

\[\hat{Z}_{t,:} = \mathrm{Attn}(Z_{t,:})\]

Layer composition

A common pattern is:

1) Temporal attention 2) Spatial attention 3) MLP

Each with residual connections and LayerNorm.


2.2 Complexity Comparison (Intuition)

Full joint attention:

\[\mathcal{O}(T^2 N^2)\]

Factorized temporal + spatial:

Temporal part:

\[\mathcal{O}(N T^2)\]

Spatial part:

\[\mathcal{O}(T N^2)\]

Total:

\[\mathcal{O}(N T^2 + T N^2)\]

This is the same big picture as ViViT Model 2, explained in a time-first manner.


2.3 Practical Strengths and Weaknesses

Strengths:

  • Much more scalable than full attention
  • Natural separation of motion modeling (temporal) and appearance (spatial)
  • Strong performance on action recognition benchmarks

Weaknesses:

  • Still can be heavy if T and N are both large
  • Temporal attention per location can overfit or be noisy if motion is subtle
  • Camera motion can complicate temporal attention patterns (background changes dominate)

3) Multiscale Vision Transformers (MViT-style)

Key idea:

  • Use hierarchical, multiscale token representations
  • Reduce tokens progressively (space and time)
  • Use attention with pooling / striding so the model scales like CNN pyramids

This is one of the most important directions for scalable video transformers.


3.1 Why Multiscale for Video?

Video has two scaling dimensions:

  • Spatial resolution H x W
  • Temporal resolution T

If you keep all tokens everywhere, attention explodes.

Multiscale strategy:

  • Early layers: high resolution, short receptive field
  • Later layers: lower resolution, larger receptive field (long-range context)

This mimics CNN inductive bias while staying transformer-based.


3.2 Pooling Attention (Concept)

Instead of attending with Q,K,V all at full length, we can downsample K and V:

Let original tokens length be L.

We compute:

\[Q \in \mathbb{R}^{L \times D}\]

but pool K and V to length L_p:

\[K_p \in \mathbb{R}^{L_p \times D}\] \[V_p \in \mathbb{R}^{L_p \times D}\]

Now attention uses:

\[\mathrm{softmax} \left( \frac{Q K_p^T}{\sqrt{D}} \right) V_p\]

This reduces cost from:

\[\mathcal{O}(L^2)\]

to:

\[\mathcal{O}(L \cdot L_p)\]

If L_p « L, you save a lot.

Pooling can be applied:

  • spatial pooling
  • temporal pooling
  • both (tubelet pooling)

3.3 Hierarchical Stages (Video Pyramid)

A typical multiscale transformer uses stages like:

Stage 1:

  • high spatial tokens
  • high temporal tokens

Stage 2:

  • reduce spatial and/or temporal

Stage 3:

  • reduce more

Stage 4:

  • very coarse tokens, global reasoning

At each stage you can increase channel dimension D while reducing L.

This resembles ResNet-style stage scaling.


3.4 Inductive Bias vs Flexibility

Multiscale introduces inductive bias:

  • locality (early)
  • hierarchy (progressive abstraction)

This is often beneficial for data efficiency and generalization, especially on moderate-scale datasets.

The tradeoff:

  • you restrict the model relative to full joint attention
  • but you gain scalability and often real-world robustness

4) Practical Design Choices Cheat Sheet

4.1 When to prefer which

  • Full spatio-temporal (ViViT Model 1):
    • best when T and N are small or compute is huge
    • most flexible interactions
  • Factorized attention (TimeSFormer / ViViT Model 2):
    • strong baseline
    • scales much better than full attention
    • good when you want a clean, interpretable design
  • Tubelet / temporal compression (ViViT Model 3):
    • best when clip length is long
    • watch out for motion detail loss
  • Hierarchical / multiscale (ViViT Model 4 / MViT):
    • best for scalability (high-res, long clips)
    • tends to generalize well
    • more engineering complexity

5) Summary Table

FamilyCore MechanismToken Length ControlMain BenefitMain Risk
ViViT Model 1Joint space-time attentionNoneMax flexibilityVery expensive
ViViT Model 2Factorized space then timeAttention factorizationMuch cheaperRestricted interactions
ViViT Model 3Tubelets / temporal compressionReduce T earlyScales to long clipsMotion detail loss
ViViT Model 4Hierarchical pooling/mergingReduce T and N over stagesBest scalabilityDesign complexity
TimeSFormerSeparate temporal + spatial attentionFactorizationClean + scalableStill heavy for big T,N
MViTMultiscale + pooled attentionHierarchy + poolingCNN-like scalingNeeds careful pooling design
This post is licensed under CC BY 4.0 by the author.