QA - Section 1.05. Concurrency

Posted Apr 17, 2026

5 min read

QA - Section 1.05. Concurrency

1. `std::thread` vs `std::async`

std::thread is a low-level abstraction that directly creates and manages a thread. You must explicitly call join() or detach(), otherwise the program will terminate.

std::async is a higher-level abstraction that can manage thread creation and result handling automatically using std::future.

Example

  
#include <thread>
#include <future>
#include <iostream>

int work() { return 42; }

int main() 
{
    std::thread t([] { std::cout << "thread\n"; });
    t.join();

    auto f = std::async(std::launch::async, work);
    std::cout << f.get() << std::endl;
}

2. What happens if you don’t store the `future`?

If you do not store the returned std::future, the temporary future is destroyed immediately.

  
std::async(std::launch::async, [] {
    // work
});

Problem:

The destructor of std::future may block until the async task finishes
This can unintentionally make your code synchronous

3. `std::launch::async` vs `std::launch::deferred`

std::launch::async
- Runs on a separate thread immediately
std::launch::deferred
- Delays execution until .get() or .wait() is called
- Runs in the calling thread
- Not parallel

Example

  
auto f1 = std::async(std::launch::async, work);
auto f2 = std::async(std::launch::deferred, work);

f1.get(); // runs in parallel
f2.get(); // runs here (lazy)

4. `join()` vs `detach()`

join()
- Blocks until thread finishes
- Ensures safe cleanup
detach()
- Runs independently
- No way to synchronize or retrieve result

Example

  
std::thread t([] { /* work */ });
t.join();   // wait

std::thread t2([] { /* work */ });
t2.detach(); // fire-and-forget

5. Why do we need a Thread Pool?

Creating threads repeatedly is expensive. A thread pool reuses a fixed number of worker threads.

Benefits:

Reduces thread creation overhead
Controls concurrency level
Improves performance stability

// Instead of creating threads repeatedly,
// reuse worker threads for tasks

6. `std::promise`, `std::future`, `std::packaged_task`

std::future
- Retrieves result
std::promise
- Sets result manually
std::packaged_task
- Wraps a callable and produces a future

Example

  
#include <future>

std::promise<int> p;
std::future<int> f = p.get_future();

std::thread t([&p] {
    p.set_value(42);
});

std::cout << f.get() << std::endl;
t.join();

  
int work()
{
    return 42;
}

int main()
{
    std::packaged_task<int()> task(work);
    std::future<int> f = task.get_future();

    std::thread t(std::move(task));

    std::cout << f.get() << std::endl; // 42

    t.join();
}

A std::future is used to retrieve the result of an asynchronous operation. It blocks when calling get() until the value is available.

A std::promise is used to set a value that will be retrieved by a std::future, allowing one thread to pass a result to another.

A std::packaged_task wraps a callable and automatically stores its result in a std::future when executed.

7. Atomic Memory Order and `memory_order_relaxed`, `acquire`, `release`

Atomic operations can control memory visibility and ordering between threads.

relaxed
- No ordering guarantees
- Only atomicity
release
- Ensures prior writes are visible
acquire
- Ensures subsequent reads see those writes

Example

  
std::atomic<bool> ready = false;
int data = 0;

// producer
data = 42;
ready.store(true, std::memory_order_release);

// consumer
if (ready.load(std::memory_order_acquire)) 
    std::cout << data << std::endl; // guaranteed correct

Within a single thread, operations appear in order, but across multiple threads, the visibility of those operations is not guaranteed to follow that same order. Due to caching, buffering, and reordering, another thread may observe updates in a different sequence than they were originally executed.

8. When to use `thread_local`

Use thread_local when each thread needs its own independent copy of data.

Example

  
thread_local int local_counter = 0;

void work() 
{
    local_counter++;
}

👉 No synchronization needed

9. Parallel STL (`std::execution::par`)

C++17 provides parallel algorithms.

Example

  
#include <execution>
#include <vector>
#include <algorithm>

std::vector<int> v(1000000);

std::for_each(std::execution::par, v.begin(), v.end(), [](int& x) {
    x *= 2;
});

Automatically parallelized

10. OpenMP Experience

OpenMP is a compiler-based parallelism API.

Example

  
#include <omp.h>

#pragma omp parallel for
for (int i = 0; i < 1000; i++) 
{
    // parallel loop
}

Simple way to parallelize loops

11. SIMD + Multithreading Considerations

SIMD and multithreading can be combined, but care is needed.

Issues:

Memory bandwidth bottleneck
Cache contention
False sharing

Example Idea

  
#include <vector>
#include <thread>
#include <immintrin.h> // AVX

void worker(float* data, size_t start, size_t end)
{
    size_t i = start;

    // SIMD (8 floats at a time with AVX)
    for (; i + 8 <= end; i += 8)
    {
        __m256 v = _mm256_loadu_ps(&data[i]);
        __m256 two = _mm256_set1_ps(2.0f);
        v = _mm256_mul_ps(v, two);
        _mm256_storeu_ps(&data[i], v);
    }

    // remainder (scalar)
    for (; i < end; ++i)
    {
        data[i] *= 2.0f;
    }
}

int main()
{
    const size_t N = 1000000;
    std::vector<float> data(N, 1.0f);

    int num_threads = std::thread::hardware_concurrency();
    std::vector<std::thread> threads;

    size_t chunk = N / num_threads;

    for (int t = 0; t < num_threads; ++t)
    {
        size_t start = t * chunk;
        size_t end = (t == num_threads - 1) ? N : start + chunk;

        threads.emplace_back(worker, data.data(), start, end);
    }

    for (auto& th : threads)
        th.join();
}

Memory Bandwidth Bottleneck

When multiple threads process large amounts of data in parallel, they may saturate the available memory bandwidth. Even if the CPU cores are capable of higher throughput, the performance becomes limited by how fast data can be loaded from memory, causing threads to stall while waiting for memory access.

Cache Contention

Cache contention occurs when multiple threads compete for the same cache resources, such as cache lines or shared cache levels (e.g., L3). This leads to frequent cache evictions and reloads, reducing cache efficiency and increasing memory access latency, which degrades overall performance.

False sharing happens when multiple threads modify different variables that reside on the same cache line. Even though the variables are independent, the cache coherence mechanism forces unnecessary invalidation and synchronization between cores, causing significant performance slowdown.

QA, QA - Selection 1

QA QA - Selection 1

This post is licensed under CC BY 4.0 by the author.

QA - Section 1.05. Concurrency

1. std::thread vs std::async

Example

2. What happens if you don’t store the future?

3. std::launch::async vs std::launch::deferred

Example

4. join() vs detach()

Example

5. Why do we need a Thread Pool?

6. std::promise, std::future, std::packaged_task

Example

7. Atomic Memory Order and memory_order_relaxed, acquire, release

Example

8. When to use thread_local

Example

9. Parallel STL (std::execution::par)

Example

10. OpenMP Experience

Example

11. SIMD + Multithreading Considerations

Example Idea

Memory Bandwidth Bottleneck

Cache Contention

False Sharing

Trending Tags

1. `std::thread` vs `std::async`

2. What happens if you don’t store the `future`?

3. `std::launch::async` vs `std::launch::deferred`

4. `join()` vs `detach()`

6. `std::promise`, `std::future`, `std::packaged_task`

7. Atomic Memory Order and `memory_order_relaxed`, `acquire`, `release`

8. When to use `thread_local`

9. Parallel STL (`std::execution::par`)