Video Transformers

Posted Feb 22, 2026

9 min read

Video Transformers

This note covers three important families of transformer-based video models:

1) ViViT (Video Vision Transformer) — Model 1 to Model 4
2) TimeSFormer (Time-Space Transformer)
3) Multiscale Vision Transformers (MViT / MViTv2 style ideas)

Throughout, I will avoid fragile inline LaTeX and use block equations only.

0) Problem Setup and Notation

A video clip is a tensor:

\[X \in \mathbb{R}^{T \times H \times W \times C}\]

Commonly:

C = 3 for RGB
T = number of frames
H, W = spatial resolution

Patch size:

\[P \times P\]

Spatial patches per frame:

\[N = \frac{H W}{P^2}\]

Tokens per clip (if we tokenize each frame into patches):

\[L = T \cdot N\]

Transformer hidden dimension:

\[D\]

We represent tokens as a sequence:

\[Z \in \mathbb{R}^{L \times D}\]

Standard attention (single head form):

\[Q = Z W_Q\] \[K = Z W_K\] \[V = Z W_V\] \[\mathrm{Attn}(Z) = \mathrm{softmax} \left( \frac{Q K^T}{\sqrt{D}} \right) V\]

Compute cost is dominated by:

\[\mathcal{O}(L^2)\]

So how we define L and how we factor attention matters a lot.

1) ViViT (Video Vision Transformer): Models 1–4

ViViT is a design space for adapting ViT to video.

The core challenge:

Video tokens explode: L = T*N can be huge
Full attention over L tokens is expensive
Need to model both appearance and motion / temporal dynamics

Below, “Model 1–4” are best understood as different factorization and aggregation strategies.

1.1 ViViT Tokenization (shared start)

Per-frame patch embedding:

Each frame at time t is:

\[X_t \in \mathbb{R}^{H \times W \times C}\]

Split into N patches, flatten each patch:

\[x_{t,i} \in \mathbb{R}^{P^2 \cdot C}\]

Linear embedding:

\[z_{t,i} = E x_{t,i}\]

Collect all tokens:

\[Z = \{ z_{t,i} \}\]

Video positional encoding usually has two components:

Spatial position within a frame
Temporal position across frames

A common additive form:

\[z_{t,i} \leftarrow z_{t,i} + p^{space}_i + p^{time}_t\]

1.2 ViViT Model 1: Spatio-Temporal Joint Attention (Full Attention)

Idea

Treat all tokens across space and time as one long sequence and apply standard transformer blocks.

Sequence length:

\[L = T \cdot N\]

Attention complexity:

\[\mathcal{O}\left( (T N)^2 \right)\]

Why it works

Directly models arbitrary interactions: any patch can attend to any other patch in any frame
Rich temporal-spatial reasoning capacity

Why it is expensive

If H=W=224, P=16:

\[N = \frac{224 \cdot 224}{16^2} = 196\]

If T=32:

\[L = 32 \cdot 196 = 6272\]

Then attention matrix is:

\[6272 \times 6272\]

That is huge in memory and compute.

When Model 1 makes sense

Short clips
Smaller resolution
Large compute budget
You want maximum flexibility and can afford it

1.3 ViViT Model 2: Factorized Space-Time Attention (Two-Stage Attention)

Idea

Replace joint attention with a factorization:

1) Spatial attention within each frame (independent across t)
2) Temporal attention across frames (for corresponding spatial tokens or for aggregated per-frame tokens)

This reduces cost by avoiding a full (TN) x (TN) matrix.

Step A: Spatial attention per frame

For each t, run attention over N tokens:

\[Z_t \in \mathbb{R}^{N \times D}\]

Cost per frame:

\[\mathcal{O}(N^2)\]

For T frames:

\[\mathcal{O}(T \cdot N^2)\]

Step B: Temporal attention

There are two common flavors.

Flavor B1: Token-wise temporal attention

For each spatial index i, attend across time:

\[Z_{:,i} \in \mathbb{R}^{T \times D}\]

Cost per spatial location:

\[\mathcal{O}(T^2)\]

For all N spatial positions:

\[\mathcal{O}(N \cdot T^2)\]

Total Model 2 cost:

\[\mathcal{O}(T N^2 + N T^2)\]

Compare to Model 1:

\[\mathcal{O}(T^2 N^2)\]

Model 2 is dramatically cheaper when T and N are not tiny.

Flavor B2: Temporal attention on per-frame pooled tokens

Instead of per-location temporal attention, summarize each frame with a token and attend over T summary tokens.

Per-frame summary token (class token or pooling):

\[s_t = \mathrm{Pool}(Z_t)\]

Sequence:

\[S \in \mathbb{R}^{T \times D}\]

Temporal attention cost:

\[\mathcal{O}(T^2)\]

Total is then mostly spatial:

\[\mathcal{O}(T N^2 + T^2)\]

Pros

Much cheaper than Model 1
Still captures temporal dynamics

Cons

Factorization imposes structural constraints
Some interactions (like patch i at time t attending to patch j at time t’) are only indirectly modeled

1.4 ViViT Model 3: Factorized Encoder with Temporal Tokenization / Pooling (Aggressive Temporal Compression)

Idea

Reduce temporal length early.

Common strategies:

1) Temporal pooling or striding on frame embeddings 2) Tubelet embedding (3D patch) to reduce tokens 3) Use temporal attention only on a small set of tokens

Tubelet embedding (common in video ViT designs)

Instead of 2D patches, use a 3D “tubelet” spanning t frames:

Tubelet size:

\[\tau \times P \times P\]

Now temporal axis is downsampled by factor \tau:

New temporal length:

\[T' = \frac{T}{\tau}\]

Tokens per clip:

\[L' = T' \cdot N\]

Attention cost becomes:

\[\mathcal{O}\left( (T' N)^2 \right)\]

which is smaller than Model 1 if T’ « T.

Pros

Significant compute reduction
Strong for longer clips

Cons

If \tau is large, you may lose fine motion details
Temporal aliasing can occur if content changes quickly between frames

1.5 ViViT Model 4: Factorized Attention + Token Pooling / Hierarchical Video Transformer

Idea

Build a hierarchical representation across layers (like CNN pyramids) so the model processes:

High resolution, many tokens early
Lower resolution, fewer tokens later

This is aligned with the “multiscale” philosophy (also used in MViT).

Common mechanisms:

1) Patch merging or pooling in space
2) Temporal pooling / merging across frames
3) Attention operating on progressively shorter sequences

Patch merging (concept)

Merge 2x2 spatial neighboring tokens into one:

Spatial resolution reduces by 2, channels increase (or projection changes).

If spatial tokens N correspond to H/P by W/P grid:

After merging:

\[N' = \frac{N}{4}\]

Similarly, temporal merging could reduce T:

\[T' = \frac{T}{2}\]

Total tokens reduce:

\[L' = T' \cdot N'\]

Why this matters

Attention cost scales with L^2, so token reduction is extremely powerful.

Pros

Scales to longer clips and higher resolution
Learns coarse-to-fine temporal-spatial representations

Cons

Requires careful design of merging rules
Some fine details may be lost at late stages

2) TimeSFormer (Time-Space Transformer)

TimeSFormer is closely related to ViViT-style factorization, but often explained with a simple, clean idea:

Separate attention into temporal attention and spatial attention
Alternate or compose them in each layer

2.1 TimeSFormer Factorized Attention

Let tokens be indexed by (t, i) where:

t: time index
i: spatial patch index

Temporal attention (per spatial position)

For each i, attend over t:

\[Z_{:,i} \in \mathbb{R}^{T \times D}\]

Temporal attention output:

\[\hat{Z}_{:,i} = \mathrm{Attn}(Z_{:,i})\]

Spatial attention (per time step)

For each t, attend over i:

\[Z_{t,:} \in \mathbb{R}^{N \times D}\]

Spatial attention output:

\[\hat{Z}_{t,:} = \mathrm{Attn}(Z_{t,:})\]

Layer composition

A common pattern is:

1) Temporal attention 2) Spatial attention 3) MLP

Each with residual connections and LayerNorm.

2.2 Complexity Comparison (Intuition)

Full joint attention:

\[\mathcal{O}(T^2 N^2)\]

Factorized temporal + spatial:

Temporal part:

\[\mathcal{O}(N T^2)\]

Spatial part:

\[\mathcal{O}(T N^2)\]

Total:

\[\mathcal{O}(N T^2 + T N^2)\]

This is the same big picture as ViViT Model 2, explained in a time-first manner.

2.3 Practical Strengths and Weaknesses

Strengths:

Much more scalable than full attention
Natural separation of motion modeling (temporal) and appearance (spatial)
Strong performance on action recognition benchmarks

Weaknesses:

Still can be heavy if T and N are both large
Temporal attention per location can overfit or be noisy if motion is subtle
Camera motion can complicate temporal attention patterns (background changes dominate)

3) Multiscale Vision Transformers (MViT-style)

Key idea:

Use hierarchical, multiscale token representations
Reduce tokens progressively (space and time)
Use attention with pooling / striding so the model scales like CNN pyramids

This is one of the most important directions for scalable video transformers.

3.1 Why Multiscale for Video?

Video has two scaling dimensions:

Spatial resolution H x W
Temporal resolution T

If you keep all tokens everywhere, attention explodes.

Multiscale strategy:

Early layers: high resolution, short receptive field
Later layers: lower resolution, larger receptive field (long-range context)

This mimics CNN inductive bias while staying transformer-based.

3.2 Pooling Attention (Concept)

Instead of attending with Q,K,V all at full length, we can downsample K and V:

Let original tokens length be L.

We compute:

\[Q \in \mathbb{R}^{L \times D}\]

but pool K and V to length L_p:

\[K_p \in \mathbb{R}^{L_p \times D}\] \[V_p \in \mathbb{R}^{L_p \times D}\]

Now attention uses:

\[\mathrm{softmax} \left( \frac{Q K_p^T}{\sqrt{D}} \right) V_p\]

This reduces cost from:

\[\mathcal{O}(L^2)\]

to:

\[\mathcal{O}(L \cdot L_p)\]

If L_p « L, you save a lot.

Pooling can be applied:

spatial pooling
temporal pooling
both (tubelet pooling)

3.3 Hierarchical Stages (Video Pyramid)

A typical multiscale transformer uses stages like:

Stage 1:

high spatial tokens
high temporal tokens

Stage 2:

reduce spatial and/or temporal

Stage 3:

reduce more

Stage 4:

very coarse tokens, global reasoning

At each stage you can increase channel dimension D while reducing L.

This resembles ResNet-style stage scaling.

3.4 Inductive Bias vs Flexibility

Multiscale introduces inductive bias:

locality (early)
hierarchy (progressive abstraction)

This is often beneficial for data efficiency and generalization, especially on moderate-scale datasets.

The tradeoff:

you restrict the model relative to full joint attention
but you gain scalability and often real-world robustness

4) Practical Design Choices Cheat Sheet

4.1 When to prefer which

Full spatio-temporal (ViViT Model 1):
- best when T and N are small or compute is huge
- most flexible interactions
Factorized attention (TimeSFormer / ViViT Model 2):
- strong baseline
- scales much better than full attention
- good when you want a clean, interpretable design
Tubelet / temporal compression (ViViT Model 3):
- best when clip length is long
- watch out for motion detail loss
Hierarchical / multiscale (ViViT Model 4 / MViT):
- best for scalability (high-res, long clips)
- tends to generalize well
- more engineering complexity

5) Summary Table

Family	Core Mechanism	Token Length Control	Main Benefit	Main Risk
ViViT Model 1	Joint space-time attention	None	Max flexibility	Very expensive
ViViT Model 2	Factorized space then time	Attention factorization	Much cheaper	Restricted interactions
ViViT Model 3	Tubelets / temporal compression	Reduce T early	Scales to long clips	Motion detail loss
ViViT Model 4	Hierarchical pooling/merging	Reduce T and N over stages	Best scalability	Design complexity
TimeSFormer	Separate temporal + spatial attention	Factorization	Clean + scalable	Still heavy for big T,N
MViT	Multiscale + pooled attention	Hierarchy + pooling	CNN-like scaling	Needs careful pooling design

Artificial Intelligence, Artificial Intelligence - Model

Artificial Intelligenc Transformer

This post is licensed under CC BY 4.0 by the author.

Video Transformers

0) Problem Setup and Notation

1) ViViT (Video Vision Transformer): Models 1–4

1.1 ViViT Tokenization (shared start)

1.2 ViViT Model 1: Spatio-Temporal Joint Attention (Full Attention)

Idea

Why it works

Why it is expensive

When Model 1 makes sense

1.3 ViViT Model 2: Factorized Space-Time Attention (Two-Stage Attention)

Idea

Step A: Spatial attention per frame

Step B: Temporal attention

Flavor B1: Token-wise temporal attention

Flavor B2: Temporal attention on per-frame pooled tokens

Pros

Cons

1.4 ViViT Model 3: Factorized Encoder with Temporal Tokenization / Pooling (Aggressive Temporal Compression)

Idea

Tubelet embedding (common in video ViT designs)

Pros

Cons

1.5 ViViT Model 4: Factorized Attention + Token Pooling / Hierarchical Video Transformer

Idea

Patch merging (concept)

Why this matters

Pros

Cons

2) TimeSFormer (Time-Space Transformer)

2.1 TimeSFormer Factorized Attention

Temporal attention (per spatial position)

Spatial attention (per time step)

Layer composition

2.2 Complexity Comparison (Intuition)

2.3 Practical Strengths and Weaknesses

3) Multiscale Vision Transformers (MViT-style)

3.1 Why Multiscale for Video?

3.2 Pooling Attention (Concept)

3.3 Hierarchical Stages (Video Pyramid)

3.4 Inductive Bias vs Flexibility

4) Practical Design Choices Cheat Sheet

4.1 When to prefer which

5) Summary Table

Trending Tags