Your First Concurrent C++ Program - Parallel Sum with std::thread
Build a parallel sum function using std::thread and analyze performance. Discover the thread overhead and the optimal number of threads for massive speedup.
If you've heard that "parallel programming makes things faster" but never tried it yourself, this tutorial is for you. We'll build a simple concurrent program that sums numbers in a vector using multiple threads, and discover why more threads don't always mean better performance.
The problem: Sum a billion numbers
Let's start with the sequential approach everyone knows:
long long sequential_sum(const std::vector<int>& data) {
return std::accumulate(data.begin(), data.end(), 0LL);
}
Simple and effective. But what if we could split this work across multiple threads?
The Concurrent solution
Here's the key insight: if we have 8 threads, we can divide our vector into 8 chunks and sum each chunk in parallel:
long long concurrent_sum(const std::vector<int>& data, int thread_count) {
std::vector<std::thread> threads;
std::vector<long long> partial_sums(thread_count, 0);
size_t chunk_size = data.size() / thread_count;
// Create threads, each summing one chunk
for (int i = 0; i < thread_count; ++i) {
size_t start = i * chunk_size;
size_t end = (i == thread_count - 1) ? data.size() : start + chunk_size;
threads.emplace_back([&data, &partial_sums, i, start, end]() {
partial_sums[i] = std::accumulate(
data.begin() + start,
data.begin() + end,
0LL
);
});
}
// Wait for all threads to finish
for (auto& t : threads) {
t.join();
}
// Combine partial results
return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
}
Breaking down the code
1. Storage for results
std::vector<long long> partial_sums(thread_count, 0);
Each thread needs its own space to store its result. This avoids the complexity of threads competing to write to the same variable.
2. Divide the work
size_t chunk_size = data.size() / thread_count;
size_t start = i * chunk_size;
size_t end = (i == thread_count - 1) ? data.size() : start + chunk_size;
We split the vector into equal chunks. The last thread handles any remaining elements (in case the size doesn't divide evenly).
3. Launch threads
threads.emplace_back([&data, &partial_sums, i, start, end]() {
partial_sums[i] = std::accumulate(data.begin() + start,
data.begin() + end,
0LL);
});
Each thread gets a lambda function that sums its assigned chunk. The lambda captures the necessary variables by reference (&) and the index by value.
4. Wait for completion
for (auto& t : threads) {
t.join();
}
join() blocks until the thread finishes. Without this, we'd try to read partial results before threads complete.
Testing it out
Let's compare both approaches:
int main() {
const size_t SIZE = 1'000'000'000; // one billion elements
const int THREAD_COUNT = 8;
std::vector<int> data(SIZE, 1); // All ones for easy verification
// Test sequential
auto start = std::chrono::high_resolution_clock::now();
long long seq_result = sequential_sum(data);
auto end = std::chrono::high_resolution_clock::now();
auto seq_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Sequential sum: " << seq_result << "\n";
std::cout << "Time: " << seq_time.count() << " ms\n\n";
// Test concurrent
start = std::chrono::high_resolution_clock::now();
long long conc_result = concurrent_sum(data, THREAD_COUNT);
end = std::chrono::high_resolution_clock::now();
auto conc_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Concurrent sum (" << THREAD_COUNT << " threads): "
<< conc_result << "\n";
std::cout << "Time: " << conc_time.count() << " ms\n";
std::cout << "Speedup: " << (conc_time.count() == 0 ? 0 : (double)seq_time.count() / conc_time.count())
<< "x\n";
return 0;
}
Sample output (on a quad-core CPU):
Sequential sum: 1000000000
Time: 10679 ms
Concurrent sum (8 threads): 1000000000
Time: 3216 ms
Speedup: 3.32058x
The reality: Size matters
Here's where it gets interesting. Let's try a smaller vector:
const size_t SIZE = 100'000; // Just 100K elements
You might get this Output:
Sequential sum: 100000
Time: 0 ms
Concurrent sum (8 threads): 100000
Time: 1 ms
Speedup: 0x
Wait, the concurrent version is slower? Welcome to the world of overhead.
Understanding thread overhead
Creating a thread isn't free. The operating system must:
- Allocate stack space.
- Set up thread-local storage.
- Schedule the thread on a CPU core.
- Handle context switches.
For tiny datasets, this overhead exceeds any benefit from parallelism. Think of it like this: if it takes you 13 seconds to sort 24 papers, hiring 8 people to each sort 3 papers will take longer when you factor in explaining the task to everyone.
Finding the sweet spot
Let's test different sizes:
SIZE = 100,000 (100K elements)
Sequential: 0 ms
Concurrent (8 threads): 0 ms
Still too small.
SIZE = 1,000,000 (1M elements)
Sequential: 8 ms
Concurrent (8 threads): 3 ms
Starting to break even.
SIZE = 100,000,000 (100M elements)
Sequential: 1409 ms
Concurrent (8 threads): 452 ms
Speedup: 3.1x
Now we're talking!
How many threads?
More threads = more speed, right? Let's test with 100M elements:
// 2 threads
Speedup: 1.6x
// 4 threads
Speedup: 2.4x
// 8 threads
Speedup: 3.1x
// 16 threads (on a 4-core CPU)
Speedup: 3.1x
Notice the diminishing returns? On a 4-core CPU with hyperthreading (8 logical cores), we get the best speedup around 8 threads. Beyond that, threads compete for resources, and context switching overhead kicks in.
Rule of thumb: Start with std::thread::hardware_concurrency() which returns the number of concurrent threads your CPU supports.
int optimal_threads = std::thread::hardware_concurrency();
std::cout << "CPU supports: " << optimal_threads << " threads\n";
It's 8 on my computer.
Common pitfalls to avoid
1. Sharing data without protection
// WRONG - Race condition!
long long total = 0;
threads.emplace_back([&data, &total, start, end]() {
total += std::accumulate(data.begin() + start,
data.begin() + end,
0LL); // Multiple threads modifying total!
});
Multiple threads writing to total simultaneously causes undefined behavior. That's why we use separate storage (partial_sums) for each thread.
2. Forgetting to join
// WRONG - Program exits before threads finish!
for (int i = 0; i < thread_count; ++i) {
threads.emplace_back(/* ... */);
}
return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
// Oops! No join() calls
Correct version:
for (int i = 0; i < thread_count; ++i) {
threads.emplace_back(/* ... */);
}
for (auto& t : threads) {
t.join(); // Wait for completion
}
return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
3. Uneven work distribution
If the last thread gets significantly more work, it becomes the bottleneck. For simple cases like ours, this isn't a problem, but for irregular workloads, consider load balancing strategies.
When to use concurrent sum
Good candidates:
- Large datasets (millions or billions of elements).
- Computationally expensive operations per element.
- Available CPU cores.
Poor candidates:
- Small datasets (under 100K elements).
- I/O-bound operations (reading files, network calls).
- Already optimized sequential code.
Exercise: Your turn to experiment
Try modifying the code:
- Test different vector sizes: 1K, 10K, 100K, 1M, 10M, 100M.
- Vary thread counts: 2, 4, 8, 16, 32.
- Change the work per element (e.g., sum of squares instead of simple sum).
- Use
std::thread::hardware_concurrency()to set the thread count.
Pay attention to where the concurrent version starts beating sequential, and where adding more threads stops helping.
Takeaway
You've just written your first concurrent program! The key lessons:
std::threadlets you run code in parallel.- Each thread needs its own workspace to avoid conflicts.
- Always
join()threads before using their results. - Parallelism has overhead—only worth it for large enough problems.
- More threads ≠ faster; match your thread count to your CPU cores.
Concurrency isn't magic. It's a tool that shines when the problem size justifies the overhead. Now you know how to measure that trade-off yourself.
Experiment with different sizes and thread counts. The best way to understand concurrency is to see it in action on your own machine.