QA - Section 1.03. Parallel
QA - Section 1.03. Parallel
1. How do you decide what to parallelize?
You should parallelize parts of the code that are computationally expensive and independent. Typically, large loops or operations over big datasets are good candidates.
Example (Good Candidate)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <thread>
#include <vector>
void process(std::vector<int>& v, int start, int end)
{
for (int i = start; i < end; i++)
v[i] *= 2;
}
int main() {
std::vector<int> v(1000000, 1);
std::thread t1(process, std::ref(v), 0, 500000);
std::thread t2(process, std::ref(v), 500000, 1000000);
t1.join();
t2.join();
}
2. Why can parallelization make things slower?
If the workload is too small, the overhead of threading dominates.
Example (Too Fine-Grained)
1
2
3
4
5
6
for (int i = 0; i < 1000; i++)
{
std::thread([] {
// very small work
}).detach();
}
Thread creation cost > actual work
3. How do you decide granularity?
You must balance between too small and too large tasks.
Example (Better Granularity)
1
2
3
4
int chunk = v.size() / 4;
for (int i = 0; i < 4; i++)
threads.emplace_back(process, std::ref(v), i * chunk, (i + 1) * chunk);
Each thread has enough work
4. How do you reduce workload imbalance?
Split work evenly or dynamically assign tasks.
Example (Static Balanced Split)
1
2
3
4
int mid = v.size() / 2;
std::thread t1(process, std::ref(v), 0, mid);
std::thread t2(process, std::ref(v), mid, v.size());
Equal work → better utilization
5. What is False Sharing?
Multiple threads update variables on the same cache line.
Example (False Sharing)
1
2
3
4
5
6
7
8
9
10
struct Data
{
int a;
int b;
};
Data d;
void t1() { for (int i = 0; i < 1e7; i++) d.a++; }
void t2() { for (int i = 0; i < 1e7; i++) d.b++; }
a and b may share the same cache line → performance drop
Fix (Padding)
1
2
3
4
5
struct Data
{
alignas(64) int a;
alignas(64) int b;
};
6. Why is cache locality important?
Accessing contiguous memory improves performance.
Example (Good Locality)
1
2
for (int i = 0; i < n; i++)
arr[i] += 1;
Bad Locality
1
2
for (int i = 0; i < n; i++)
arr[i * 1024] += 1;
Cache miss increases
7. Amdahl’s Law
Even if you parallelize part of the code, the serial part limits speedup.
Example
1
2
// 90% parallel, 10% serial
// Max speedup ≈ 10x even with infinite threads
8. What are parallelization overheads?
- Thread creation
- Synchronization
- Context switching
Example
1
2
std::thread t1(func);
t1.join(); // synchronization overhead
9. What is lock contention?
Multiple threads compete for the same lock.
Example
1
2
3
4
5
6
7
8
9
10
#include <mutex>
std::mutex m;
int counter = 0;
void inc()
{
std::lock_guard<std::mutex> lock(m);
counter++;
}
Many threads → waiting → performance drop
10. Why can memory allocation become a bottleneck?
Allocators are often shared and require synchronization.
Example
1
2
3
4
5
6
7
8
void work()
{
for (int i = 0; i < 1000000; i++)
{
int* p = new int(i);
delete p;
}
}
Multiple threads doing this → allocator contention
In a multithreaded environment, new and delete themselves do not directly cause threads to wait. Instead, they rely on an internal memory allocator that must be thread-safe. Because multiple threads may access the same memory pool at the same time, the allocator uses synchronization mechanisms such as locks or atomic operations. As a result, when one thread is allocating or freeing memory, other threads may have to wait, leading to contention and potential performance overhead.
Better Approach (Object Reuse)
1
2
3
4
std::vector<int> buffer(1000000);
for (int i = 0; i < buffer.size(); i++)
buffer[i] = i;