Convolution

Posted Feb 10, 2026 Updated Apr 14, 2026

3 min read

Convolution

🧠 From Fully-Connected Layers to Convolutional Neural Networks (CNNs)

1️⃣ Fully-Connected (FC) Layer

Definition

A fully-connected layer assumes every input influences every output.

Let:

Input vector: $x \in \mathbb{R}^d$
Weight matrix: $W \in \mathbb{R}^{c \times d}$
Bias: $b \in \mathbb{R}^c$
Output (score): $s \in \mathbb{R}^c$

Then,

$s = Wx + b$

📌 Example:

CIFAR-10 image: $32 \times 32 \times 3 = 3072$
Classes: $c = 10$
Parameters: $3072 \times 10 + 10 = 30,730$

⚠️ Problems:

Too many parameters
No spatial structure
Overfitting risk

2️⃣ Spatial Locality

Key Idea

Nearby pixels are more related than distant ones.

Instead of connecting everything:

Look at local patches
Reuse the same filter across space

This leads to convolution.

🧠 Human intuition:

Eyes, nose, edges → local patterns

3️⃣ Convolutional Layer

Single-Channel (Grayscale)

Input: $32 \times 32 \times 1$
Filter: $3 \times 3$
Operation: slide filter and compute dot product

At each location:

$y_{ij} = \sum_{u=1}^{3}\sum_{v=1}^{3} x_{i+u, j+v} \cdot w_{uv} + b$

Multi-Channel (RGB)

Input: $32 \times 32 \times 3$
Filter: $3 \times 3 \times 3$

Each output pixel:

$y_{ij} = \sum_{c=1}^{3} \sum_{u,v} x_{i+u, j+v, c} \cdot w_{u,v,c} + b$

➡️ Parameters per filter: $3 \times 3 \times 3 + 1 = 28$

4️⃣ Multiple Filters = Multiple Feature Maps

If we use $K$ filters:

Output depth = $K$
Output volume: $W’ \times H’ \times K$

📌 Example:

Input: $32 \times 32 \times 3$
Filters: $4$ filters of $5 \times 5 \times 3$
Stride: $1$, Padding: $0$

Output size:

$W’ = 32 - 5 + 1 = 28$

➡️ Output: $28 \times 28 \times 4$

5️⃣ Output Size Formula (Very Important ⭐)

For convolution:

$W’ = \frac{W - F + 2P}{S} + 1$
$H’ = \frac{H - F + 2P}{S} + 1$

Where:

$F$: filter size
$S$: stride
$P$: padding

6️⃣ Padding

Why Padding?

Preserve spatial size
Allow deeper networks

Common choice:

$P = \frac{F - 1}{2}$

📌 Examples:

$F=3 \Rightarrow P=1$
$F=5 \Rightarrow P=2$

With padding, output size remains unchanged.

7️⃣ Padding Example

Given:

Input: $32 \times 32 \times 3$
Filters: $10$ of size $5 \times 5 \times 3$
Stride: $1$
Padding: $2$

Output:

$32 \times 32 \times 10$

Parameters:

$10 \times (5 \times 5 \times 3 + 1) = 760$

💥 Compared to FC: $(32 \times 32 \times 10)(32 \times 32 \times 3 + 1) = 31,467,520$

🔥 Massive reduction!

8️⃣ 1×1 Convolution

What is it?

Filter size: $1 \times 1 \times C$

Each filter computes:

$y = \sum_{c=1}^{C} x_c w_c + b$

➡️ Mixes channels, not space

📌 Example:

Input: $32 \times 32 \times 3$
Filters: $6$ of $1 \times 1 \times 3$

Output: $32 \times 32 \times 6$

Parameters: $6 \times (3 + 1) = 24$

✨ Used in:

Bottlenecks
Channel reduction
ResNet, Inception

9️⃣ Nested Convolutional Layers

CNNs learn hierarchical features:

Level	Learns
Low	Edges, colors
Mid	Corners, textures
High	Objects, faces

🎯 Each level builds on the previous one.

🔟 Stride

Stride controls how far the filter moves.

📌 Example:

Input: $7 \times 7$
Filter: $3 \times 3$
Stride: $2$

Output size:

$\frac{7 - 3}{2} + 1 = 3$

➡️ Output: $3 \times 3$

🚀 Larger stride → smaller output → faster

1️⃣1️⃣ Pooling Layer

Purpose

Downsampling
Reduce computation
Improve robustness

No learning parameters ❌

Max Pooling

Filter: $2 \times 2$, Stride: $2$

Selects:

$\max {x_{ij}}$

Preserves strongest activation 💪

Average Pooling

Computes:

$\frac{1}{4}\sum x_{ij}$

Smoother but less sharp

1️⃣2️⃣ Pooling Output Size

Same formula as convolution (without padding):

$W’ = \frac{W - F}{S} + 1$
$H’ = \frac{H - F}{S} + 1$

Depth remains unchanged.

1️⃣3️⃣ Convolution vs Fully-Connected

Aspect	FC	Conv
Connectivity	Global	Local
Parameters	Huge	Small
Spatial info	❌	✅
Weight sharing	❌	✅

🧠 Conv is a special case of FC with many zero weights.

🎯 Final Summary

Convolutional Layer Hyperparameters

Number of filters: $K$
Filter size: $F$
Stride: $S$
Padding: $P$

Output: $W’ \times H’ \times K$

Parameters: $K(F^2C + 1)$

Pooling Layer Hyperparameters

Filter size: $F$
Stride: $S$

Parameters: $0$ ❌

✅ Takeaways

CNNs exploit spatial locality
Weight sharing drastically reduces parameters
Deep stacking → hierarchical features
Pooling + stride control resolution

🚀 CNNs scale to large images efficiently

Artificial Intelligence, Artificial Intelligence - Model

Artificial Intelligence CNN

This post is licensed under CC BY 4.0 by the author.

🧠 From Fully-Connected Layers to Convolutional Neural Networks (CNNs)

1️⃣ Fully-Connected (FC) Layer

Definition

2️⃣ Spatial Locality

Key Idea

3️⃣ Convolutional Layer

Single-Channel (Grayscale)

Multi-Channel (RGB)

4️⃣ Multiple Filters = Multiple Feature Maps

5️⃣ Output Size Formula (Very Important ⭐)

6️⃣ Padding

Why Padding?

7️⃣ Padding Example

8️⃣ 1×1 Convolution

What is it?

9️⃣ Nested Convolutional Layers

🔟 Stride

1️⃣1️⃣ Pooling Layer

Purpose

Max Pooling

Average Pooling

1️⃣2️⃣ Pooling Output Size

1️⃣3️⃣ Convolution vs Fully-Connected

🎯 Final Summary

Convolutional Layer Hyperparameters

Pooling Layer Hyperparameters

✅ Takeaways

Trending Tags