Using multithreads for simple computation

Hi everyone. I am trying to understand how to use threads to compute simple operations faster. I am just having difficulty understanding the relation between using a thread and calling a function.

For example:

I have a set of numbers, and I want to calculate whether or not each number is prime. Without threads this is really simple, just use a loop and call the isPrime function (self defined) bla bla bla etc.

How does this work with threads? Do you assign the value to a thread, pass the thread to isPrime? Does the thread hold multiple numbers? I don't understand how we can use threads to compute this problem faster.
understanding the relation between using a thread and calling a function.

Using a thread *is* calling a function. When a thread is created, it immediately begins executing the function you pass to its constructor. When the function returns, the thread ends.

I don't understand how we can use threads to compute this problem faster.

You'd need to rewrite it in a way that makes it possible to call functions that don't need each other's results to make progress.

Take a simpler (and more parallelizable) task, calculate the sum of all numbers in a sequence:

a[0] + a[1] + a[2] + ... + a[N-1]

You can make a function call that adds the whole thing together:

1
2
3
4
5
6
7
8
9
#include <iostream>
#include <numeric>
#include <vector>
int main()
{
    std::vector<int> data{100000, 1};
    int result = std::accumulate(data.begin(), data.end(), 0);
    std::cout << result << '\n';
}


There is nothing threads can do there.

But you can rewrite that program to make two function calls, one for each half of the sequence:

1
2
3
4
5
6
7
8
9
10
11
12
#include <iostream>
#include <numeric>
#include <vector>
int main()
{
    std::vector<int> data{100000, 1};
    size_t mid = data.size()/2;
    int sum1 = std::accumulate(data.begin(), data.begin() + mid, 0);
    int sum2 = std::accumulate(data.begin() + mid, data.end(), 0);
    int result = sum1 + sum2;
    std::cout << result;
}


now if you call one of these functions on a thread, it can make progress while the other function is also making progress. Once they are both done, you can combine the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <iostream>
#include <numeric>
#include <vector>
#include <thread>
int main()
{
    std::vector<int> data{100000, 1};
    size_t mid = data.size()/2;
    int sum1;
    std::thread t([&]{sum1 = std::accumulate(data.begin(), data.begin() + mid, 0);});
    int sum2 = std::accumulate(data.begin() + mid, data.end(), 0);
    t.join();
    int result = sum1 + sum2;
    std::cout << result;
}

that does the same thing, in half the time if you have enough CPU cores

(note in this example, the two threads (main and t) never access the same variable: t reads from a[0]..a[mid-1] and writes to sum1, main reads from a[mid]..a[n-1] and writes to sum2. If two threads have to access the same variable, things get complicated.
Last edited on
I started the following program a while ago and had to go off to do other stuff in the meantime. So though cubbi's post covers it all I thought might as well finish what I started, so here goes and similar to above post there's no concurrent data access by construction:
generate 50 million random numbers b/w 1 and 10,000 and count how many are prime:
doing it in one thread uses at least 2.5X longer CPU time compared to doing the same task split across 50 threads, and that is AFTER taking into a/c the time spent constructing the threads:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <ctime>
#include <random>
#include <future>
#include <iomanip>
#include <utility>
#include <iterator>
#include <numeric>

bool isPrime(const unsigned int n)
{
    if (n == 1)return true;

    for (int i = 2; i * i  <= n; ++i)
    {
        if(!(n%i))return false;
    }
    return true;
}
unsigned int countPrime(const std::vector<unsigned int>& vec)
{
    unsigned int num{};
    for (const auto& elem : vec)if(isPrime(elem))++num;
    return num;
}

constexpr auto SIZE = 50000000;
constexpr auto MULT = 1000000;
constexpr auto low_bound = 1;
constexpr auto up_bound = 10000;

int main()
{
    auto seed = std::chrono::system_clock::now().time_since_epoch().count();//seed
    std::default_random_engine dre(seed);//engine
    std::uniform_int_distribution<unsigned int> di(low_bound,up_bound);//distribution

    std::vector<unsigned int> data(SIZE);
    std::generate(data.begin(), data.end(), [&dre, &di]{ return di(dre);});
    //http://en.cppreference.com/w/cpp/algorithm/generate

    const auto startSingleThread = std::clock();

    unsigned int num_single{};
    for (const auto& elem : data)if(isPrime(elem))++num_single;
    std::cout << "single thread prime count: " << num_single << "\n";


    const auto finishSingleThread = std::clock();
    //http://en.cppreference.com/w/cpp/chrono/c/clock

     std::cout << std::fixed << std::setprecision(2) << "single thread cpu time used: "
              << 1000 * (finishSingleThread - startSingleThread) /CLOCKS_PER_SEC << " ms \n";
    //http://en.cppreference.com/w/cpp/io/manip/setprecision

     std::vector<std::vector<unsigned int>> subs(SIZE/MULT, std::vector<unsigned int> (MULT));
    for (size_t i = 0; i < subs.size(); ++i)
    {
        subs[i].assign(std::make_move_iterator(data.begin() + MULT * i),
                       std::make_move_iterator(data.begin() + MULT * i + MULT));
       //http://en.cppreference.com/w/cpp/container/vector/assign
       //http://en.cppreference.com/w/cpp/iterator/make_move_iterator
    }

     std::vector<std::future<unsigned int>> threads{};
     //http://en.cppreference.com/w/cpp/thread/future

    const auto startMultiThread = std::clock();
    for (size_t i = 0; i < SIZE/MULT; ++i)
    {
      threads.emplace_back((std::async(std::launch::async, countPrime, std::move(subs[i]))));
      //http://en.cppreference.com/w/cpp/container/vector/emplace_back
      //http://en.cppreference.com/w/cpp/thread/launch
      //http://en.cppreference.com/w/cpp/utility/move
    }

    unsigned int num_multi{};
    for (auto& elem : threads) num_multi += elem.get();
    //http://en.cppreference.com/w/cpp/thread/future/get
    std::cout << "multi thread prime count: " << num_multi << "\n";
    const auto finishMultiThread = std::clock();

    std::cout << std::fixed << std::setprecision(2) << "multi thread cpu time used: "
              << 1000 * (finishMultiThread - startMultiThread)/CLOCKS_PER_SEC << " ms \n";
}

Sample Output
1
2
3
4
single thread prime count: 6152508
single thread cpu time used: 7304 ms
multi thread prime count: 6152508
multi thread cpu time used: 2676 ms

An interesting learning point for me was that a multi-threaded program with too many threads can actually be slower than a single thread program, the drawback being the time taken to construct the thread (here std::future) objects. OP: you can play around with the optimal # threads and see if you can push the ratio of single threaded / multi-threaded cpu time used even higher

Edit: full-disclosure – the above program was complied and run with C::B IDE using gcc 4.9.2 (I think). I also tried to run it from the command line with mingw-gcc 6.1 (from nuwen.net) and getting an error on line 74, 81 “invalid use of incomplete type'class std::future<unsigned int>' - checking
Edit2: hmm, could it be this?
http://stackoverflow.com/questions/10209871/c11-stdasync-doesnt-work-in-mingw
Last edited on
The act of creating a thread has an expense. If the computation is too small, its not worth it. threading identical code over a list of values is great, divide the inputs into chunks and go, and you will get a flat speedup ... 2 cpus, 2x faster, 3 cpus, 3x faster, etc. Hard to beat that. Remember to recycle your main thread for one of the workers if appropriate. Its "free" (no thread create overhead) but it will lock up a GUI etc so there are places where this isnt good, and places where its OK depending on the program. Each thread can loop over a set of inputs or each thread can compute 1 value and return, this is going to depend on your design and needs. A thread is just a function, it works the same way, you just need to be careful of race conditions and shared data problems etc. I suggest doing several in a loop on each thread, if you have many numbers to process.

However for prime checking, consider the primorial approach for at least a partial gain. That is, if you do the 'factorial' algorithm on all the primes you know up to some value, you can use the GCD algorithm which is VERY efficient to check if a number is prime. You need a way to store the huge value, just like factorials, if you go this route though. And after some cutoff, you have to swap back to a sieve /whatever algorithm. This is very good for 32 and maybe 64 bits worth of primes, if you are doing something that uses the smaller values (not encryption or huge prime stuff).

This isnt clear, so maybe an example..

2*3*5*7 = 210.
is 5 prime?
gcd(5, 210) is 5. Therefore its prime, as it was in the original group multiplied together.


Last edited on
the above program was complied and run with C::B IDE using gcc 4.9.2 (I think). I also tried to run it from the command line with mingw-gcc 6.1 (from nuwen.net) and getting an error on line 74, 81

for the record, the program I'd posted above also runs fine on VS2017 (Community), though taking much longer. So there might be a problem with mingw-gcc in this regard or I could be missing the right command line syntax. Sample VS stats:
1
2
3
4
5
6
7

single thread prime count: 6151259
single thread cpu time used: 38390 ms
multi thread prime count: 6151259
multi thread cpu time used: 12183 ms
//...
The program '[1451592] ConsoleApplication3.exe' has exited with code 0 (0x0).
is there a purpose for the code or just playing?
If the cutoff is 10k, a map of all the primes up to there should finish it in < 1 sec.
is there a purpose for the code or just playing?

read the OP, I don't think you've understood the motivation behind it. It's not seeking the fastest way to find prime numbers but rather seeks to understand if multi-threading is faster than single threads so even if the algorithm for finding primes it not the most efficient it is moot in this case, the key is we are comparing like for like and demonstrating that multi-threads are faster
Last edited on
ah, ok. Carry on then :)
thanks for your patience but I think I'm done for this round unless ... ;))
Topic archived. No new replies allowed.