Loop Unrolling in 5x5 Convolution: Why It Matters
Loop Unrolling
Prerequites
1
2
C++
Convolution operation
What is Loop Unrolling
In image processing, convolution is one of the most fundamental operations.
Whether it is smoothing, sharpening, edge detection, or feature extraction, many operations are built on top of convolution kernels.
Among them, a 5×5 convolution is a common example because it is still small enough to optimize aggressively while being large enough to show meaningful performance differences.
One classic low-level optimization technique here is loop unrolling.
Loop unrolling is an optimization technique where loop iterations are expanded manually.
Instead of writing:
1
2
3
4
for(int i = 0; i < 5; ++i)
{
sum += a[i] * b[i];
}
we expand it as:
1
2
3
4
5
sum += a[0] * b[0];
sum += a[1] * b[1];
sum += a[2] * b[2];
sum += a[3] * b[3];
sum += a[4] * b[4];
The goal is to reduce:
- loop counter operations
- branch instructions
- repeated index calculations
- address computations
1. How to deal with Loop unrolling on 5×5 Convolution Formula
For an image src and kernel K:
Each pixel computation performs:
- 25 multiplications
- 24 additions
Even small overhead reductions can matter when applied to millions of pixels.
Basic 5×5 Convolution Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void Convolution5x5_Basic(
const uint8_t* src,
uint8_t* dst,
int width,
int height,
int stride,
const float kernel[5][5])
{
for (int y = 2; y < height - 2; ++y)
{
for (int x = 2; x < width - 2; ++x)
{
float sum = 0.0f;
for (int ky = 0; ky < 5; ++ky)
{
for (int kx = 0; kx < 5; ++kx)
{
int ix = x + kx - 2;
int iy = y + ky - 2;
sum += src[iy * stride + ix] * kernel[ky][kx];
}
}
if (sum < 0) sum = 0;
if (sum > 255) sum = 255;
dst[y * stride + x] = (uint8_t)sum;
}
}
}
Characteristics
Pros
- simple
- readable
- flexible
Cons
- loop overhead
- repeated indexing
- additional arithmetic operations
Fully Unrolled 5×5 Convolution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
void Convolution5x5_Unrolled(
const uint8_t* src,
uint8_t* dst,
int width,
int height,
int stride,
const float kernel[5][5])
{
for (int y = 2; y < height - 2; ++y)
{
for (int x = 2; x < width - 2; ++x)
{
const uint8_t* r0 = src + (y-2)*stride + (x-2);
const uint8_t* r1 = src + (y-1)*stride + (x-2);
const uint8_t* r2 = src + (y)*stride + (x-2);
const uint8_t* r3 = src + (y+1)*stride + (x-2);
const uint8_t* r4 = src + (y+2)*stride + (x-2);
float sum = 0;
sum += r0[0]*kernel[0][0];
sum += r0[1]*kernel[0][1];
sum += r0[2]*kernel[0][2];
sum += r0[3]*kernel[0][3];
sum += r0[4]*kernel[0][4];
sum += r1[0]*kernel[1][0];
sum += r1[1]*kernel[1][1];
sum += r1[2]*kernel[1][2];
sum += r1[3]*kernel[1][3];
sum += r1[4]*kernel[1][4];
sum += r2[0]*kernel[2][0];
sum += r2[1]*kernel[2][1];
sum += r2[2]*kernel[2][2];
sum += r2[3]*kernel[2][3];
sum += r2[4]*kernel[2][4];
sum += r3[0]*kernel[3][0];
sum += r3[1]*kernel[3][1];
sum += r3[2]*kernel[3][2];
sum += r3[3]*kernel[3][3];
sum += r3[4]*kernel[3][4];
sum += r4[0]*kernel[4][0];
sum += r4[1]*kernel[4][1];
sum += r4[2]*kernel[4][2];
sum += r4[3]*kernel[4][3];
sum += r4[4]*kernel[4][4];
if (sum < 0)
sum = 0;
if (sum > 255)
sum = 255;
dst[y * stride + x] = (uint8_t)sum;
}
}
}
2. Why Loop Unrolling Can Be Faster
The improvement does not come from reducing the number of multiplications.
Both implementations still perform:
- 25 multiplications
- 24 additions
The benefit comes from reducing surrounding overhead.
2-1. Reduced Loop Control
The nested loops remove:
- counter updates
- branch checks
2-2. Reduced Index Arithmetic
Instead of repeatedly computing:
1
iy * stride + ix
row pointers are calculated once.
2-3. Better Instruction Scheduling
Flat sequences of operations allow the compiler to:
- schedule instructions efficiently
- pipeline operations
- reduce dependency stalls
3. Performance Intuition
For a 1920×1080 image
Pixels:
\[1920 \times 1080 \approx 2.07M\]Operations per frame:
\[2.07M \times 25 \approx 51.75M\]Even tiny savings per pixel can produce measurable gains.
Trade-offs
| Aspect | Nested Loop | Unrolled |
|---|---|---|
| Readability | High | Low |
| Maintainability | High | Low |
| Flexibility | High | Low |
| Loop overhead | Present | Removed |
| Index arithmetic | More | Less |
| Code size | Small | Large |
Manual unrolling is useful when:
- kernel size is fixed
- code is performance critical
- you control the full pipeline
- profiling shows convolution is a bottleneck
Otherwise, maintainable code is usually preferable.
Loop unrolling demonstrates an important principle in high-performance computing. Sometimes the cost is not in the math itself.
The cost is in how the computation is structured. For small fixed kernels such as 5×5 convolution, removing loop overhead can provide meaningful speed improvements in performance-critical systems.