By Nhut Nguyen in Multithreading — 17 Nov 2025

Your First Concurrent C++ Program - Parallel Sum with std::thread

Build a parallel sum function using std::thread and analyze performance. Discover the thread overhead and the optimal number of threads for massive speedup.

If you've heard that "parallel programming makes things faster" but never tried it yourself, this tutorial is for you. We'll build a simple concurrent program that sums numbers in a vector using multiple threads, and discover why more threads don't always mean better performance.

The problem: Sum a billion numbers

Let's start with the sequential approach everyone knows:

long long sequential_sum(const std::vector<int>& data) {
    return std::accumulate(data.begin(), data.end(), 0LL);
}

Simple and effective. But what if we could split this work across multiple threads?

The Concurrent solution

Here's the key insight: if we have 8 threads, we can divide our vector into 8 chunks and sum each chunk in parallel:

long long concurrent_sum(const std::vector<int>& data, int thread_count) {
    std::vector<std::thread> threads;
    std::vector<long long> partial_sums(thread_count, 0);
    
    size_t chunk_size = data.size() / thread_count;
    
    // Create threads, each summing one chunk
    for (int i = 0; i < thread_count; ++i) {
        size_t start = i * chunk_size;
        size_t end = (i == thread_count - 1) ? data.size() : start + chunk_size;
        
        threads.emplace_back([&data, &partial_sums, i, start, end]() {
            partial_sums[i] = std::accumulate(
                data.begin() + start,
                data.begin() + end,
                0LL
            );
        });
    }
    
    // Wait for all threads to finish
    for (auto& t : threads) {
        t.join();
    }
    
    // Combine partial results
    return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
}

Breaking down the code

1. Storage for results

std::vector<long long> partial_sums(thread_count, 0);

Each thread needs its own space to store its result. This avoids the complexity of threads competing to write to the same variable.

2. Divide the work

size_t chunk_size = data.size() / thread_count;
size_t start = i * chunk_size;
size_t end = (i == thread_count - 1) ? data.size() : start + chunk_size;

We split the vector into equal chunks. The last thread handles any remaining elements (in case the size doesn't divide evenly).

3. Launch threads

threads.emplace_back([&data, &partial_sums, i, start, end]() {
    partial_sums[i] = std::accumulate(data.begin() + start, 
                                       data.begin() + end, 
                                       0LL);
});

Each thread gets a lambda function that sums its assigned chunk. The lambda captures the necessary variables by reference (&) and the index by value.

4. Wait for completion

for (auto& t : threads) {
    t.join();
}

join() blocks until the thread finishes. Without this, we'd try to read partial results before threads complete.

Testing it out

Let's compare both approaches:

int main() {
    const size_t SIZE = 1'000'000'000;  // one billion elements
    const int THREAD_COUNT = 8;
    
    std::vector<int> data(SIZE, 1);  // All ones for easy verification
    
    // Test sequential
    auto start = std::chrono::high_resolution_clock::now();
    long long seq_result = sequential_sum(data);
    auto end = std::chrono::high_resolution_clock::now();
    auto seq_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    
    std::cout << "Sequential sum: " << seq_result << "\n";
    std::cout << "Time: " << seq_time.count() << " ms\n\n";
    
    // Test concurrent
    start = std::chrono::high_resolution_clock::now();
    long long conc_result = concurrent_sum(data, THREAD_COUNT);
    end = std::chrono::high_resolution_clock::now();
    auto conc_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    
    std::cout << "Concurrent sum (" << THREAD_COUNT << " threads): " 
              << conc_result << "\n";
    std::cout << "Time: " << conc_time.count() << " ms\n";
    std::cout << "Speedup: " << (conc_time.count() == 0 ? 0 : (double)seq_time.count() / conc_time.count()) 
              << "x\n";
    
    return 0;
}

Sample output (on a quad-core CPU):

Sequential sum: 1000000000
Time: 10679 ms

Concurrent sum (8 threads): 1000000000
Time: 3216 ms
Speedup: 3.32058x

The reality: Size matters

Here's where it gets interesting. Let's try a smaller vector:

const size_t SIZE = 100'000;  // Just 100K elements

You might get this Output:

Sequential sum: 100000
Time: 0 ms

Concurrent sum (8 threads): 100000
Time: 1 ms
Speedup: 0x

Wait, the concurrent version is slower? Welcome to the world of overhead.

Understanding thread overhead

Creating a thread isn't free. The operating system must:

Allocate stack space.
Set up thread-local storage.
Schedule the thread on a CPU core.
Handle context switches.

For tiny datasets, this overhead exceeds any benefit from parallelism. Think of it like this: if it takes you 13 seconds to sort 24 papers, hiring 8 people to each sort 3 papers will take longer when you factor in explaining the task to everyone.

Finding the sweet spot

Let's test different sizes:

SIZE = 100,000 (100K elements)

Sequential: 0 ms
Concurrent (8 threads): 0 ms

Still too small.

SIZE = 1,000,000 (1M elements)

Sequential: 8 ms
Concurrent (8 threads): 3 ms

Starting to break even.

SIZE = 100,000,000 (100M elements)

Sequential: 1409 ms
Concurrent (8 threads): 452 ms
Speedup: 3.1x

Now we're talking!

How many threads?

More threads = more speed, right? Let's test with 100M elements:

// 2 threads
Speedup: 1.6x

// 4 threads
Speedup: 2.4x

// 8 threads
Speedup: 3.1x

// 16 threads (on a 4-core CPU)
Speedup: 3.1x

Notice the diminishing returns? On a 4-core CPU with hyperthreading (8 logical cores), we get the best speedup around 8 threads. Beyond that, threads compete for resources, and context switching overhead kicks in.

Rule of thumb: Start with std::thread::hardware_concurrency() which returns the number of concurrent threads your CPU supports.

int optimal_threads = std::thread::hardware_concurrency();
std::cout << "CPU supports: " << optimal_threads << " threads\n";

It's 8 on my computer.

Common pitfalls to avoid

1. Sharing data without protection

// WRONG - Race condition!
long long total = 0;
threads.emplace_back([&data, &total, start, end]() {
    total += std::accumulate(data.begin() + start, 
                             data.begin() + end, 
                             0LL);  // Multiple threads modifying total!
});

Multiple threads writing to total simultaneously causes undefined behavior. That's why we use separate storage (partial_sums) for each thread.

2. Forgetting to join

// WRONG - Program exits before threads finish!
for (int i = 0; i < thread_count; ++i) {
    threads.emplace_back(/* ... */);
}
return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
// Oops! No join() calls

Correct version:

for (int i = 0; i < thread_count; ++i) {
    threads.emplace_back(/* ... */);
}
for (auto& t : threads) {
    t.join();  // Wait for completion
}
return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);

3. Uneven work distribution
If the last thread gets significantly more work, it becomes the bottleneck. For simple cases like ours, this isn't a problem, but for irregular workloads, consider load balancing strategies.

When to use concurrent sum

Good candidates:

Large datasets (millions or billions of elements).
Computationally expensive operations per element.
Available CPU cores.

Poor candidates:

Small datasets (under 100K elements).
I/O-bound operations (reading files, network calls).
Already optimized sequential code.

Exercise: Your turn to experiment

Try modifying the code:

Test different vector sizes: 1K, 10K, 100K, 1M, 10M, 100M.
Vary thread counts: 2, 4, 8, 16, 32.
Change the work per element (e.g., sum of squares instead of simple sum).
Use std::thread::hardware_concurrency() to set the thread count.

Pay attention to where the concurrent version starts beating sequential, and where adding more threads stops helping.

Takeaway

You've just written your first concurrent program! The key lessons:

std::thread lets you run code in parallel.
Each thread needs its own workspace to avoid conflicts.
Always join() threads before using their results.
Parallelism has overhead—only worth it for large enough problems.
More threads ≠ faster; match your thread count to your CPU cores.

Concurrency isn't magic. It's a tool that shines when the problem size justifies the overhead. Now you know how to measure that trade-off yourself.

Experiment with different sizes and thread counts. The best way to understand concurrency is to see it in action on your own machine.