SIMD (Single Instruction Multiple Data)

Posted Mar 18, 2026

2 min read

SIMD

Prerequisites

C++
Basic CPU architecture
Arrays / memory layout
Basic understanding of performance optimization

1. What is SIMD?

SIMD stands for Single Instruction, Multiple Data.

It is a hardware-level technique that allows a single instruction to process multiple data elements at the same time.

SIMD is not just parallelism.
It is supported by a dedicated execution path inside the CPU, through vector execution units and vector registers.

A CPU does not only have a general-purpose ALU.
Modern CPUs also include vector units specifically designed for SIMD operations. That’s why when I enter the production line that use Window7’s old cpu doesn’t work AVX, just SSE.

1-1. Scalar vs SIMD

Instead of processing elements one by one:

  
for (int i = 0; i < 4; i++)
    C[i] = A[i] + B[i];

SIMD processes multiple elements in one instruction:

$[A0\;A1\;A2\;A3] + [B0\;B1\;B2\;B3] = [C0\;C1\;C2\;C3]$

  
#include <emmintrin.h>

__m128i m128A = _mm_loadu_si128((const __m128i*)arrI32A);
__m128i m128B = _mm_loadu_si128((const __m128i*)arrI32B);
__m128i m128C = _mm_add_epi32(m128A, m128B);

1-2. SIMD = Dedicated Hardware Path

SIMD works because the CPU has dedicated hardware:

Vector registers
Vector execution units
SIMD instruction decoding
Packed arithmetic datapaths

Scalar path:

ALU / FPU
One element per instruction

SIMD path:

Vector Unit
Multiple elements per instruction

1-3. Register Width

SSE → 128-bit
AVX → 256-bit
AVX-512 → 512-bit

Example:

SSE → 4 floats
AVX → 8 floats
AVX-512 → 16 floats

1-4. Instruction Latency

Not all SIMD instructions are equal.

add → cheap
multiply → moderate
divide → expensive

Optimization means:

avoid expensive instructions
replace division with multiplication when possible
check instruction latency

SIMD intel url: intel.com/content/www/us/en/docs/intrinsics-guide/index.html

2. How to use SIMD?

int i = 0;

for (; i <= N - 8; i += 8) { __m256 a = _mm256_loadu_ps(A + i); __m256 b = _mm256_loadu_ps(B + i); __m256 c = _mm256_add_ps(a, b); _mm256_storeu_ps(C + i, c); }

// tail for (; i < N; i++) C[i] = A[i] + B[i];

2-1. Core Optimization Factors

Data structure design
Instruction latency
Compiler behavior
Memory alignment
Cache locality
Memory bandwidth

2-2. Data Layout

Bad (AoS):

  
struct Pixel 
{ 
    float R; 
    float G; 
    float B; 
};

#####Good (SoA):

  
float R[N], G[N], B[N];

SIMD prefers contiguous data.

2-3. Memory Alignment

  
_mm256_load_ps // aligned  
_mm256_loadu_ps // unaligned  

float* data = (float*)_mm_malloc(sizeof(float) * N, 32);

Alignment improves performance.

2-4. Memory is Often the Bottleneck

SIMD improves compute throughput, but performance depends on:

cache locality
memory bandwidth
access pattern

If memory is slow, SIMD gains are limited.

3. What is challenge of SIMD?

3-1. Branching Problem

SIMD is weak when:

many branches
unpredictable logic
irregular flow

Best case:

same operation
no branching
continuous data

3-2. When SIMD Works Well

Image processing
Signal processing
Matrix operations
Pixel-wise transforms

3-3. When SIMD is Hard

Branch-heavy logic
Random memory access
Small data
Data dependency

3-4. Important Insight

SIMD does not guarantee speedup.

Real performance depends on:

memory
cache
algorithm design
compiler output

CODE, CODE - Optimization

CODE CODE - Optimization

This post is licensed under CC BY 4.0 by the author.