Post

SIMD (Single Instruction Multiple Data)

SIMD (Single Instruction Multiple Data)

SIMD


Prerequisites

  • C++
  • Basic CPU architecture
  • Arrays / memory layout
  • Basic understanding of performance optimization

1. What is SIMD?

SIMD stands for Single Instruction, Multiple Data.

It is a hardware-level technique that allows a single instruction to process multiple data elements at the same time.

SIMD is not just parallelism.
It is supported by a dedicated execution path inside the CPU, through vector execution units and vector registers.

A CPU does not only have a general-purpose ALU.
Modern CPUs also include vector units specifically designed for SIMD operations. That’s why when I enter the production line that use Window7’s old cpu doesn’t work AVX, just SSE.

1-1. Scalar vs SIMD

Instead of processing elements one by one:

1
2
for (int i = 0; i < 4; i++)
    C[i] = A[i] + B[i];

SIMD processes multiple elements in one instruction:

$[A0\;A1\;A2\;A3] + [B0\;B1\;B2\;B3] = [C0\;C1\;C2\;C3]$

1
2
3
4
5
#include <emmintrin.h>

__m128i m128A = _mm_loadu_si128((const __m128i*)arrI32A);
__m128i m128B = _mm_loadu_si128((const __m128i*)arrI32B);
__m128i m128C = _mm_add_epi32(m128A, m128B);

1-2. SIMD = Dedicated Hardware Path

SIMD works because the CPU has dedicated hardware:

  • Vector registers
  • Vector execution units
  • SIMD instruction decoding
  • Packed arithmetic datapaths

Scalar path:

  • ALU / FPU
  • One element per instruction

SIMD path:

  • Vector Unit
  • Multiple elements per instruction

1-3. Register Width

SSE → 128-bit
AVX → 256-bit
AVX-512 → 512-bit

Example:

  • SSE → 4 floats
  • AVX → 8 floats
  • AVX-512 → 16 floats

1-4. Instruction Latency

Not all SIMD instructions are equal.

  • add → cheap
  • multiply → moderate
  • divide → expensive

Optimization means:

  • avoid expensive instructions
  • replace division with multiplication when possible
  • check instruction latency

SIMD intel url: intel.com/content/www/us/en/docs/intrinsics-guide/index.html

2. How to use SIMD?

int i = 0;

for (; i <= N - 8; i += 8) { __m256 a = _mm256_loadu_ps(A + i); __m256 b = _mm256_loadu_ps(B + i); __m256 c = _mm256_add_ps(a, b); _mm256_storeu_ps(C + i, c); }

// tail for (; i < N; i++) C[i] = A[i] + B[i];

2-1. Core Optimization Factors

  1. Data structure design
  2. Instruction latency
  3. Compiler behavior
  4. Memory alignment
  5. Cache locality
  6. Memory bandwidth

2-2. Data Layout

Bad (AoS):
1
2
3
4
5
6
struct Pixel 
{ 
    float R; 
    float G; 
    float B; 
};

#####Good (SoA):

1
float R[N], G[N], B[N];

SIMD prefers contiguous data.

2-3. Memory Alignment

1
2
3
4
_mm256_load_ps // aligned  
_mm256_loadu_ps // unaligned  

float* data = (float*)_mm_malloc(sizeof(float) * N, 32);

Alignment improves performance.

2-4. Memory is Often the Bottleneck

SIMD improves compute throughput, but performance depends on:

  • cache locality
  • memory bandwidth
  • access pattern

If memory is slow, SIMD gains are limited.

3. What is challenge of SIMD?

3-1. Branching Problem

SIMD is weak when:

  • many branches
  • unpredictable logic
  • irregular flow

Best case:

  • same operation
  • no branching
  • continuous data

3-2. When SIMD Works Well

  • Image processing
  • Signal processing
  • Matrix operations
  • Pixel-wise transforms

3-3. When SIMD is Hard

  • Branch-heavy logic
  • Random memory access
  • Small data
  • Data dependency

3-4. Important Insight

SIMD does not guarantee speedup.

Real performance depends on:

  • memory
  • cache
  • algorithm design
  • compiler output
This post is licensed under CC BY 4.0 by the author.