Multithreading a loop

Hello programmers,

I have started working on multithreading and I am getting a little bit confused. Assuming that I have a huge array of double number, a million of small numbers for example, and I want to find the sum of every single element. Without using multithreading I will have to add to a variable every array[i] number. However multithreading could speed up the process I can divide the array to 4 parts for instance, create 4 threads calculate the sum of each part return it to the main function and the sum them, how could I make such a program? An example with a code could be really usefully to help me understand what exactly is going on, thanks in advance!

P.S. I am programming in C++11
An example with a code could be really usefully to help me understand what exactly is going on

https://github.com/kimwalisch/primesieve/blob/master/src/ParallelSieve.cpp

Hope this example helps ;)
> However multithreading could speed up the process
> I can divide the array to 4 parts for instance, create 4 threads calculate the sum of each part

Yes. But there will not be a four fold increase in speed: summing up numbers is a fast operation and doing things asynchronously introduces its own overhead.


> how could I make such a program?
> An example with a code could be really usefully to help me understand what exactly is going on

Using std::async https://en.cppreference.com/w/cpp/thread/async

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#include <iostream>
#include <numeric>
#include <future>
#include <ctime>
#include <chrono>
#include <string>

auto psum( const double* begin, const double* end, std::size_t seg_size )
{
    const std::size_t n =  end - begin ;
    if( n <= seg_size ) return std::accumulate( begin, end, 0.0 ) ;

    const double* next_seg_begin = begin + seg_size ;

    // compute the sum of the tail asynchronously (potentially in another thread)
    auto future_sum_tail = std::async( std::launch::async, psum, next_seg_begin, end, seg_size ) ;

    // compute the sum of the first segment in this thread
    const auto sum_head = std::accumulate( begin, begin+seg_size, 0.0 ) ;

    return sum_head + future_sum_tail.get() ;
}

struct timer
{
    timer( std::string n ) : name(n) {}

    ~timer()
    {
        using namespace std::chrono ;
        const auto finish_p = std::clock() ;
        const auto finish_w = steady_clock::now() ;
        std::cout << '\n' << name << "\n---------\n"
                  << " processor: " << (finish_p-start_p) * 1000.0 / CLOCKS_PER_SEC << " milliseconds.\n"
                  << "wall clock: " << duration_cast<milliseconds>(finish_w-start_w).count() << " milliseconds.\n" ;
    }

    const std::string name ;
    const std::clock_t start_p = std::clock() ;
    const std::chrono::steady_clock::time_point start_w =  std::chrono::steady_clock::now() ;
};

int main()
{
    const std::size_t N = 8'000'000 ;
    static double seq[N];
    std::iota( seq, seq+N, 0 ) ;

    double sum ;
    {
        timer t { "linear" };
        sum = std::accumulate( seq, seq+N, 0.0 ) ;
    }
    std::cout << "sum: " << sum << '\n' ;

    {
        timer t { "async (4 segments)" };
        sum = psum( seq, seq+N, N/4 ) ;
    }
    std::cout << "sum: " << sum << '\n' ;
}

http://coliru.stacked-crooked.com/a/6f92a6985f63a184
https://rextester.com/IIU68808
Note that the processor time reported by the Microsoft library is actually the elapsed wall clock time


> I am programming in C++11

When you (and implementations) switch to C++17, you would be able to use std::reduce
with an appropriate execution policy, say std::execution::parallel_unsequenced_policy
https://en.cppreference.com/w/cpp/algorithm/reduce
You could try the industry-standard OpenMP ( https://www.openmp.org/ )
It is routinely available with many compilers, e.g. the gcc:

g++ -fopenmp temp.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <ctime>
#include "omp.h"
using namespace std;

int main()
{
   const int N = 100000000;
   static double A[N];
   for ( int i = 0, sign = 1; i < N; i++, sign = -sign )
   {
      A[i] = sign * 4.0 / ( 1.0 + 2.0 * i );
   }

   for ( int nThreads = 1; nThreads <= 8; nThreads++ )
   {
      clock_t start = clock();
      double sum = 0.0;

      omp_set_num_threads( nThreads );                     // OpenMP
      #pragma omp parallel for reduction( + : sum )        // OpenMP
      for ( int i = 0; i < N; i++ )
      {
         sum += A[i];
      }

      clock_t end = clock();

      cout << "Threads: " << nThreads << "     Sum: " << sum << "     Time: " << (double)( end - start ) / CLOCKS_PER_SEC << " s\n";
   }
}


Threads: 1     Sum: 3.14159     Time: 0.249 s
Threads: 2     Sum: 3.14159     Time: 0.141 s
Threads: 3     Sum: 3.14159     Time: 0.078 s
Threads: 4     Sum: 3.14159     Time: 0.078 s
Threads: 5     Sum: 3.14159     Time: 0.062 s
Threads: 6     Sum: 3.14159     Time: 0.047 s
Threads: 7     Sum: 3.14159     Time: 0.047 s
Threads: 8     Sum: 3.14159     Time: 0.031 s


I tried the OpenMP library since it looked like a fast and a simple way to implement it on my code however the results before using the library were faster than the ones after including it

1
2
3
4
5
6
7
Simple for:
Sum: 2.39484e-05
Time taken: 1.422164

For running in two threads:
Sum: 2.39484e-05
Time taken: 3.720002


Why is it happening, though?

P.S. I am working to implement the std::async way
@KarampistisDimitris,
I'm surprised. If I comment out the OpenMP lines in my code then it takes the same time as a single thread.

What compiler options are you using? Are you sure that you haven't left debug options on? Could you show your code?

OpenMP is so easy to use compared with the C++ standard library thread routines and the fact that it is available out of the box for the Gnu compilers (and a few others: https://www.openmp.org/resources/openmp-compilers-tools/ ) is a big plus.
@lastchance

The compiler:
1
2
3
4
g++ (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


I have the debug options on since I am still developing the program, the options where on, on both simple for and with the OpenMP library (should I turn them off?).
 
g++ -Wall -g -fopenmp -std=c++11 simple.cpp -o out


I did not use the exact same code I made some research online and I found that I can parallel a for using
1
2
 
#pragma omp parallel for num_threads(2) 


instead of
1
2
omp_set_num_threads( nThreads ); 
#pragma omp parallel for reduction( + : sum ) 


I just found the first way easier to implement (It may actually be a wrong decision, correct me if I am wrong)
Last edited on
I should turn the debugging options (particularly -g) off if you are interested in timing the threads.

Obviously there are several different ways of writing the parallel for ... but how are you doing the reduction to get the sum? What happens if you try to get 4 threads rather than 2?

Did you try my code on your PC?
Last edited on
Your code actually is faster than a simple for, I just tried it on my system

This is the code I am using to multithread the sum

1
2
3
4
5
6
7
8
9
10
11
 #pragma omp parallel num_threads(2)
 {
   // This code will be executed by two threads.
   
   // Chunks of this loop will be divided amongst
   // the (two) threads of the current team.
   #pragma omp for
   for(int n=0; n<100000000; ++n) sum += A[n];
 }
//Source: https://bisqwit.iki.fi/story/howto/openmp/
//Slightly modified 



1
2
3
4
5
6
7
8
9
Your code:
Threads: 2     Sum: 6.28319     Time: 0.633998 s

Single thread
Threads: 1     Sum: 6.28319     Time: 0.668649 s

Source code:
Threads: 2     Sum: 6.28319     Time: 0.708978 s


*EDIT* You are actually right the source code is not always giving the right sum sometimes one thread is lost. Reduction is needed
Last edited on
some questions...
are you running this thing in debug mode??
I ran the dumb for loop and its 0.35 in debug mode for me (and I have a meh ish work laptop.. its not powerful at all). Its too fast to register on a crude clock() timer in a release build.

I put in a high res chrono timer and get
0.00000030 sec
in release mode.

and that makes *sense*. your cpu is 3.0 neighborhood gigahertz. Thats 3 billion operations per second. You are only adding 100M values, and addition isnt very complicated.

and I get that its an example, .. what do you want to do for real? For this example, if you really wanted to DO this, you would sum it in the loop that populates the array, not 2 loops (one to make and one to sum). As a counter example to a way to actually shave some time off the task. It might be cheaper on total time taken program wide if you multi threaded the array population loop, which actually does a little work!
Last edited on
At this point I am trying to learn some things about multithread, so that I can multithread this code

1
2
3
4
5
	for(int i = 0; i < N; i++){
		for(int j = 0; j < M; j++){
			c[i][j] = b[i][j] + a[i][j];
		}
	}


However before going instantly on the apove code before nested for loops and 2d array I want to make sure that I have understand the basics. I am currently experimenting on simpler examples. After that I will try other mathematical expressions inside the for. When you are saying >On debug mode do you mean through gdb or any other similar program? If so I am not running it through a debugging program just executing it using
./out

I mean when you compiled it. Did you do g++ -O3 or similar? The default option is debug enabled code which is MUCH slower. If you take that out, you can't use the debugger anymore, but the code is faster.

multi-threading may actually slow you down on release code builds in a lot of cases if you don't have a very large amount of data or a complicated problem. Its kind of tricky to get a feel for it.

you can write that for loop as a single loop if a,b,c are arrays or single memory blocks. Learning multi threading is important, and its a strange way to think. You have to think about it on the front and back ends.. First you have to have an algorithm for the task that can make good use of the MP and then often the last thing you do after tuning the workhorse loop/function/etc is actually break it up and thread it. Tuning the loop in the middle is a whole lot of learning as well, but its equally important.
Last edited on
I literally just noticed it I was using the default option -O0 changing to -O3 is actually increasing the speed
Moreover if I get it right generally speaking you are not 100% sure whether multithreading will make your code faster since some CPUs may be busy so the on thread will have to wait for the others to end, does not it?
Last edited on
Ok... that is interesting. Try O2?

What I tested
1
2
3
4
5
6
7
8
9
10
11
12
13
  const int N = 100000000;
  int sign;
  double sum = 0;
   static double A[N];
   for ( int i = 0, sign = 1; i < N; i++, sign = -sign )
   {
      A[i] = sign * 4.0 / ( 1.0 + 2.0 * i );
   }

 high_resolution_clock::time_point t1 = high_resolution_clock::now();
 for(int n=0; n<100000000; ++n) sum += A[n];  
 high_resolution_clock::time_point t2 = high_resolution_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);


I get .35 for -O0, .034 for -O1, and e-7 for O2 and O3

Its not just busy CPU. Asking for a thread takes time in both the OS and the hardware, time that very fast things (like this loop taking e-7 sec) can actually finish the task faster than the overhead of splitting it up into threads takes. And its not hard and fast. If you looped 100 times, you would see it.. 100M maybe the threads beat it... somewhere there is a break even point. CPUs and OS do NOT wait on threads to end. They context switch (also very expensive) and give a little time to each thread on most desktop type OS. There are some systems or controls to modify that behavior, but that is the typical PC way. But more than 1 thread per CPU is counterproductive on most hardware. Leaving 1 cpu for the OS and burning the rest with your code is a solid approach for typical needs.

I put it on my hot-box @ home. .24 / 026 / 3e-7 … interestingly no faster on the most optimal version, indicating it is maybe bound by a factor of clock speed.
Last edited on
Topic archived. No new replies allowed.