SIMD (Single Instruction Multiple Data)
SIMD
Prerequisites
- C++
- Basic CPU architecture
- Arrays / memory layout
- Basic understanding of performance optimization
1. What is SIMD?
SIMD stands for Single Instruction, Multiple Data.
It is a hardware-level technique that allows a single instruction to process multiple data elements at the same time.
SIMD is not just parallelism.
It is supported by a dedicated execution path inside the CPU, through vector execution units and vector registers.
A CPU does not only have a general-purpose ALU.
Modern CPUs also include vector units specifically designed for SIMD operations. That’s why when I enter the production line that use Window7’s old cpu doesn’t work AVX, just SSE.
1-1. Scalar vs SIMD
Instead of processing elements one by one:
1
2
for (int i = 0; i < 4; i++)
C[i] = A[i] + B[i];
SIMD processes multiple elements in one instruction:
$[A0\;A1\;A2\;A3] + [B0\;B1\;B2\;B3] = [C0\;C1\;C2\;C3]$
1
2
3
4
5
#include <emmintrin.h>
__m128i m128A = _mm_loadu_si128((const __m128i*)arrI32A);
__m128i m128B = _mm_loadu_si128((const __m128i*)arrI32B);
__m128i m128C = _mm_add_epi32(m128A, m128B);
1-2. SIMD = Dedicated Hardware Path
SIMD works because the CPU has dedicated hardware:
- Vector registers
- Vector execution units
- SIMD instruction decoding
- Packed arithmetic datapaths
Scalar path:
- ALU / FPU
- One element per instruction
SIMD path:
- Vector Unit
- Multiple elements per instruction
1-3. Register Width
SSE → 128-bit
AVX → 256-bit
AVX-512 → 512-bit
Example:
- SSE → 4 floats
- AVX → 8 floats
- AVX-512 → 16 floats
1-4. Instruction Latency
Not all SIMD instructions are equal.
- add → cheap
- multiply → moderate
- divide → expensive
Optimization means:
- avoid expensive instructions
- replace division with multiplication when possible
- check instruction latency
SIMD intel url: intel.com/content/www/us/en/docs/intrinsics-guide/index.html
2. How to use SIMD?
int i = 0;
for (; i <= N - 8; i += 8) { __m256 a = _mm256_loadu_ps(A + i); __m256 b = _mm256_loadu_ps(B + i); __m256 c = _mm256_add_ps(a, b); _mm256_storeu_ps(C + i, c); }
// tail for (; i < N; i++) C[i] = A[i] + B[i];
2-1. Core Optimization Factors
- Data structure design
- Instruction latency
- Compiler behavior
- Memory alignment
- Cache locality
- Memory bandwidth
2-2. Data Layout
Bad (AoS):
1
2
3
4
5
6
struct Pixel
{
float R;
float G;
float B;
};
#####Good (SoA):
1
float R[N], G[N], B[N];
SIMD prefers contiguous data.
2-3. Memory Alignment
1
2
3
4
_mm256_load_ps // aligned
_mm256_loadu_ps // unaligned
float* data = (float*)_mm_malloc(sizeof(float) * N, 32);
Alignment improves performance.
2-4. Memory is Often the Bottleneck
SIMD improves compute throughput, but performance depends on:
- cache locality
- memory bandwidth
- access pattern
If memory is slow, SIMD gains are limited.
3. What is challenge of SIMD?
3-1. Branching Problem
SIMD is weak when:
- many branches
- unpredictable logic
- irregular flow
Best case:
- same operation
- no branching
- continuous data
3-2. When SIMD Works Well
- Image processing
- Signal processing
- Matrix operations
- Pixel-wise transforms
3-3. When SIMD is Hard
- Branch-heavy logic
- Random memory access
- Small data
- Data dependency
3-4. Important Insight
SIMD does not guarantee speedup.
Real performance depends on:
- memory
- cache
- algorithm design
- compiler output