QA - Section 1.05. Concurrency
QA - Section 1.05. Concurrency
1. std::thread vs std::async
std::thread is a low-level abstraction that directly creates and manages a thread. You must explicitly call join() or detach(), otherwise the program will terminate.
std::async is a higher-level abstraction that can manage thread creation and result handling automatically using std::future.
Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <thread>
#include <future>
#include <iostream>
int work() { return 42; }
int main()
{
std::thread t([] { std::cout << "thread\n"; });
t.join();
auto f = std::async(std::launch::async, work);
std::cout << f.get() << std::endl;
}
2. What happens if you don’t store the future?
If you do not store the returned std::future, the temporary future is destroyed immediately.
1
2
3
std::async(std::launch::async, [] {
// work
});
Problem:
- The destructor of
std::futuremay block until the async task finishes - This can unintentionally make your code synchronous
3. std::launch::async vs std::launch::deferred
std::launch::async- Runs on a separate thread immediately
std::launch::deferred- Delays execution until
.get()or.wait()is called - Runs in the calling thread
- Not parallel
- Delays execution until
Example
1
2
3
4
5
auto f1 = std::async(std::launch::async, work);
auto f2 = std::async(std::launch::deferred, work);
f1.get(); // runs in parallel
f2.get(); // runs here (lazy)
4. join() vs detach()
join()- Blocks until thread finishes
- Ensures safe cleanup
detach()- Runs independently
- No way to synchronize or retrieve result
Example
1
2
3
4
5
std::thread t([] { /* work */ });
t.join(); // wait
std::thread t2([] { /* work */ });
t2.detach(); // fire-and-forget
5. Why do we need a Thread Pool?
Creating threads repeatedly is expensive. A thread pool reuses a fixed number of worker threads.
Benefits:
- Reduces thread creation overhead
- Controls concurrency level
- Improves performance stability
1
2
// Instead of creating threads repeatedly,
// reuse worker threads for tasks
6. std::promise, std::future, std::packaged_task
std::future- Retrieves result
std::promise- Sets result manually
std::packaged_task- Wraps a callable and produces a future
Example
1
2
3
4
5
6
7
8
9
10
11
#include <future>
std::promise<int> p;
std::future<int> f = p.get_future();
std::thread t([&p] {
p.set_value(42);
});
std::cout << f.get() << std::endl;
t.join();
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int work()
{
return 42;
}
int main()
{
std::packaged_task<int()> task(work);
std::future<int> f = task.get_future();
std::thread t(std::move(task));
std::cout << f.get() << std::endl; // 42
t.join();
}
A std::future is used to retrieve the result of an asynchronous operation. It blocks when calling get() until the value is available.
A std::promise is used to set a value that will be retrieved by a std::future, allowing one thread to pass a result to another.
A std::packaged_task wraps a callable and automatically stores its result in a std::future when executed.
7. Atomic Memory Order and memory_order_relaxed, acquire, release
Atomic operations can control memory visibility and ordering between threads.
relaxed- No ordering guarantees
- Only atomicity
release- Ensures prior writes are visible
acquire- Ensures subsequent reads see those writes
Example
1
2
3
4
5
6
7
8
9
10
std::atomic<bool> ready = false;
int data = 0;
// producer
data = 42;
ready.store(true, std::memory_order_release);
// consumer
if (ready.load(std::memory_order_acquire))
std::cout << data << std::endl; // guaranteed correct
Within a single thread, operations appear in order, but across multiple threads, the visibility of those operations is not guaranteed to follow that same order. Due to caching, buffering, and reordering, another thread may observe updates in a different sequence than they were originally executed.
8. When to use thread_local
Use thread_local when each thread needs its own independent copy of data.
Example
1
2
3
4
5
6
thread_local int local_counter = 0;
void work()
{
local_counter++;
}
👉 No synchronization needed
9. Parallel STL (std::execution::par)
C++17 provides parallel algorithms.
Example
1
2
3
4
5
6
7
8
9
#include <execution>
#include <vector>
#include <algorithm>
std::vector<int> v(1000000);
std::for_each(std::execution::par, v.begin(), v.end(), [](int& x) {
x *= 2;
});
Automatically parallelized
10. OpenMP Experience
OpenMP is a compiler-based parallelism API.
Example
1
2
3
4
5
6
7
#include <omp.h>
#pragma omp parallel for
for (int i = 0; i < 1000; i++)
{
// parallel loop
}
Simple way to parallelize loops
11. SIMD + Multithreading Considerations
SIMD and multithreading can be combined, but care is needed.
Issues:
- Memory bandwidth bottleneck
- Cache contention
- False sharing
Example Idea
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <vector>
#include <thread>
#include <immintrin.h> // AVX
void worker(float* data, size_t start, size_t end)
{
size_t i = start;
// SIMD (8 floats at a time with AVX)
for (; i + 8 <= end; i += 8)
{
__m256 v = _mm256_loadu_ps(&data[i]);
__m256 two = _mm256_set1_ps(2.0f);
v = _mm256_mul_ps(v, two);
_mm256_storeu_ps(&data[i], v);
}
// remainder (scalar)
for (; i < end; ++i)
{
data[i] *= 2.0f;
}
}
int main()
{
const size_t N = 1000000;
std::vector<float> data(N, 1.0f);
int num_threads = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
size_t chunk = N / num_threads;
for (int t = 0; t < num_threads; ++t)
{
size_t start = t * chunk;
size_t end = (t == num_threads - 1) ? N : start + chunk;
threads.emplace_back(worker, data.data(), start, end);
}
for (auto& th : threads)
th.join();
}
Memory Bandwidth Bottleneck
When multiple threads process large amounts of data in parallel, they may saturate the available memory bandwidth. Even if the CPU cores are capable of higher throughput, the performance becomes limited by how fast data can be loaded from memory, causing threads to stall while waiting for memory access.
Cache Contention
Cache contention occurs when multiple threads compete for the same cache resources, such as cache lines or shared cache levels (e.g., L3). This leads to frequent cache evictions and reloads, reducing cache efficiency and increasing memory access latency, which degrades overall performance.
False Sharing
False sharing happens when multiple threads modify different variables that reside on the same cache line. Even though the variables are independent, the cache coherence mechanism forces unnecessary invalidation and synchronization between cores, causing significant performance slowdown.