Post

Convolution

Convolution

🧠 From Fully-Connected Layers to Convolutional Neural Networks (CNNs)


1️⃣ Fully-Connected (FC) Layer

Definition

A fully-connected layer assumes every input influences every output.

Let:

  • Input vector: $x \in \mathbb{R}^d$
  • Weight matrix: $W \in \mathbb{R}^{c \times d}$
  • Bias: $b \in \mathbb{R}^c$
  • Output (score): $s \in \mathbb{R}^c$

Then,

$s = Wx + b$

📌 Example:

  • CIFAR-10 image: $32 \times 32 \times 3 = 3072$
  • Classes: $c = 10$
  • Parameters: $3072 \times 10 + 10 = 30,730$

⚠️ Problems:

  • Too many parameters
  • No spatial structure
  • Overfitting risk

2️⃣ Spatial Locality

Key Idea

Nearby pixels are more related than distant ones.

Instead of connecting everything:

  • Look at local patches
  • Reuse the same filter across space

This leads to convolution.

🧠 Human intuition:

  • Eyes, nose, edges → local patterns

3️⃣ Convolutional Layer

Single-Channel (Grayscale)

  • Input: $32 \times 32 \times 1$
  • Filter: $3 \times 3$
  • Operation: slide filter and compute dot product

At each location:

$y_{ij} = \sum_{u=1}^{3}\sum_{v=1}^{3} x_{i+u, j+v} \cdot w_{uv} + b$


Multi-Channel (RGB)

  • Input: $32 \times 32 \times 3$
  • Filter: $3 \times 3 \times 3$

Each output pixel:

$y_{ij} = \sum_{c=1}^{3} \sum_{u,v} x_{i+u, j+v, c} \cdot w_{u,v,c} + b$

➡️ Parameters per filter: $3 \times 3 \times 3 + 1 = 28$


4️⃣ Multiple Filters = Multiple Feature Maps

If we use $K$ filters:

  • Output depth = $K$
  • Output volume: $W’ \times H’ \times K$

📌 Example:

  • Input: $32 \times 32 \times 3$
  • Filters: $4$ filters of $5 \times 5 \times 3$
  • Stride: $1$, Padding: $0$

Output size:

$W’ = 32 - 5 + 1 = 28$

➡️ Output: $28 \times 28 \times 4$


5️⃣ Output Size Formula (Very Important ⭐)

For convolution:

$W’ = \frac{W - F + 2P}{S} + 1$
$H’ = \frac{H - F + 2P}{S} + 1$

Where:

  • $F$: filter size
  • $S$: stride
  • $P$: padding

6️⃣ Padding

Why Padding?

  • Preserve spatial size
  • Allow deeper networks

Common choice:

$P = \frac{F - 1}{2}$

📌 Examples:

  • $F=3 \Rightarrow P=1$
  • $F=5 \Rightarrow P=2$

With padding, output size remains unchanged.


7️⃣ Padding Example

Given:

  • Input: $32 \times 32 \times 3$
  • Filters: $10$ of size $5 \times 5 \times 3$
  • Stride: $1$
  • Padding: $2$

Output:

$32 \times 32 \times 10$

Parameters:

$10 \times (5 \times 5 \times 3 + 1) = 760$

💥 Compared to FC: $(32 \times 32 \times 10)(32 \times 32 \times 3 + 1) = 31,467,520$

🔥 Massive reduction!


8️⃣ 1×1 Convolution

What is it?

Filter size: $1 \times 1 \times C$

Each filter computes:

$y = \sum_{c=1}^{C} x_c w_c + b$

➡️ Mixes channels, not space

📌 Example:

  • Input: $32 \times 32 \times 3$
  • Filters: $6$ of $1 \times 1 \times 3$

Output: $32 \times 32 \times 6$

Parameters: $6 \times (3 + 1) = 24$

✨ Used in:

  • Bottlenecks
  • Channel reduction
  • ResNet, Inception

9️⃣ Nested Convolutional Layers

CNNs learn hierarchical features:

LevelLearns
LowEdges, colors
MidCorners, textures
HighObjects, faces

🎯 Each level builds on the previous one.


🔟 Stride

Stride controls how far the filter moves.

📌 Example:

  • Input: $7 \times 7$
  • Filter: $3 \times 3$
  • Stride: $2$

Output size:

$\frac{7 - 3}{2} + 1 = 3$

➡️ Output: $3 \times 3$

🚀 Larger stride → smaller output → faster


1️⃣1️⃣ Pooling Layer

Purpose

  • Downsampling
  • Reduce computation
  • Improve robustness

No learning parameters ❌


Max Pooling

Filter: $2 \times 2$, Stride: $2$

Selects:

$\max {x_{ij}}$

Preserves strongest activation 💪


Average Pooling

Computes:

$\frac{1}{4}\sum x_{ij}$

Smoother but less sharp


1️⃣2️⃣ Pooling Output Size

Same formula as convolution (without padding):

$W’ = \frac{W - F}{S} + 1$
$H’ = \frac{H - F}{S} + 1$

Depth remains unchanged.


1️⃣3️⃣ Convolution vs Fully-Connected

AspectFCConv
ConnectivityGlobalLocal
ParametersHugeSmall
Spatial info
Weight sharing

🧠 Conv is a special case of FC with many zero weights.


🎯 Final Summary

Convolutional Layer Hyperparameters

  • Number of filters: $K$
  • Filter size: $F$
  • Stride: $S$
  • Padding: $P$

Output: $W’ \times H’ \times K$

Parameters: $K(F^2C + 1)$


Pooling Layer Hyperparameters

  • Filter size: $F$
  • Stride: $S$

Parameters: $0$ ❌


✅ Takeaways

  • CNNs exploit spatial locality
  • Weight sharing drastically reduces parameters
  • Deep stacking → hierarchical features
  • Pooling + stride control resolution

🚀 CNNs scale to large images efficiently

This post is licensed under CC BY 4.0 by the author.