Confusing performance with std::thread.

So I've been using std::thread a little recently on my laptop with its dual-core processor.

The following program is supposed to do the boring job of incrementing a number continuously and then printing the total. Different memory is given to each thread to avoid data races.

It takes one argument, the number of threads to try and create.

EDIT: New code in more recent post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include <iostream>
#include <thread>
#include <vector>
#include <cstdio>
#include <ctime>

const long long AMOUNT_TO_INCREMENT_BY = 5000000000LL;

void NumberIterator(long long *number, long long count)
{
  count += *number;
  while (*number < count) ++(*number);
}

int main(int argc, char **argv)
{
  if (argc != 2) {
    std::cerr << "WRONG NUMBER OF ARGUMENTS!  Must have a single integer argument.\n";
    return 0;
  }
  
  int numberOfThreads;
  std::sscanf(argv[1], "%d", &numberOfThreads);
  
  if (numberOfThreads <= 0) {
    std::cerr << "WRONG ARGUMENT!  Must have a single positive integer argument.\n";
    return 0;
  }
  
  long long
    number = 0,
    amountToIncrementByPerThread = AMOUNT_TO_INCREMENT_BY / numberOfThreads;
  int amountLeftOver = static_cast<int> (AMOUNT_TO_INCREMENT_BY % numberOfThreads);
  
  // Allocate different regions for each thread to work in.
  std::vector<long long> data(numberOfThreads, 0);
  
  std::clock_t timeVal = std::clock();
  
  // I construct numberOfThreads threads, each with their allocated parameters.
  std::vector<std::thread> myThreads;
  myThreads.reserve(numberOfThreads);
  for (int i = 0; i < numberOfThreads; ++i) {
    myThreads.push_back(std::thread(NumberIterator, &data[i], amountToIncrementByPerThread));
  }
  
  // Here I wait for the threads to complete their main functions.
  for (auto &i: myThreads) {
    i.join();
  }
  
  // And I finish-up by adding the remaining numbers.
  for (auto &i: data) {
    number += i;
  }
  NumberIterator(&number, amountLeftOver);
  
  timeVal = std::clock() - timeVal;
  
  std::cout << "Total should be " << AMOUNT_TO_INCREMENT_BY << ", is " << number << ".\n";
  std::cout << "Completed in " << timeVal * 1000 / CLOCKS_PER_SEC << "ms.\n";
  
  return 0;
}


Performance is decent on Windows. Using CL <filename> /EHsc /Za it completes with the one thread in about 16s, and with 4 threads it takes about 10s (which is roughly what I'm after, as my processor has 2 cores).

Performance is undesirable on Linux. Using g++ <filename>.cpp -std=c++11 -o<filename> -pthread with 1 thread we get about 15.5s (well done, g++) and with 4 threads we get about 37s. Errr, what? I was really expecting something even close to the original time, yet alone well over twice as slow.

So, am I completely misinterpreting how to use std::thread, or is this the kind of situation that concurrent processing really isn't optimised for?

Any input on my errors or the curiosities of threads appreciated.
Last edited on
Asking the threads to write to directly adjacent memory locations is a textbook false sharing. Also, you should enable optimization when compiling anything that is intended for any kind of performance rest.
Ah! I suspected their consecutiveness was the issue!

Then, Cubbi, how would I give them reasonable distance? Allocate separate heap storage for each number?

Yes, I suppose using optimisations would be sensible, but would it really help with this problem?
Optimisations really mess with the program. -O2 and -O3 guess the game and just spit out the answer straight away. -O1 runs in about 2.8s, so better. Have yet to try optimisations on Windows.
Okay, now compiling with -O1 on g++, and /O2 on CL.exe.

Timing is about 2.5s with 1 thread, 1.4s with 2 threads on Windows.
About 2.6s with 1 thread, 2.8s with 2 threads on Linux.
And here's a lovely and easy solution I'm surprised I didn't think of:

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3

Going to try it.
Still getting the issue, slightly less degrade in performance this time.


$ ./threads_test 2
Total should be 5000000000, is 5000000000.
Completed in 19550ms.
$ ./threads_test 1
Total should be 5000000000, is 5000000000.
Completed in 17620ms.


Compiling with: g++ -O3 -std=c++11 threads_test.cpp -othreads_test -pthread.

Earlier I reported strangely small times, but after making the appropriate variable volatile that's stopped so that was probably doing a little more optimisation than I wanted.

Here is the current code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <iostream>
#include <thread>
#include <vector>
#include <cstdio>
#include <ctime>

const long long AMOUNT_TO_INCREMENT_BY = 5000000000LL;

void NumberIterator(long long *number, long long count)
{
  volatile long long localNumber = *number; // Volatile to avoid it being optimised away.
  count += localNumber;
  while (localNumber < count) ++localNumber;
  *number = localNumber;
}

int main(int argc, char **argv)
{
  if (argc != 2) {
    std::cerr << "WRONG NUMBER OF ARGUMENTS!  Must have a single integer argument.\n";
    return 0;
  }
  
  int numberOfThreads;
  std::sscanf(argv[1], "%d", &numberOfThreads);
  
  if (numberOfThreads <= 0) {
    std::cerr << "WRONG ARGUMENT!  Must have a single positive integer argument.\n";
    return 0;
  }
  
  long long
    number = 0,
    amountToIncrementByPerThread = AMOUNT_TO_INCREMENT_BY / numberOfThreads;
  int amountLeftOver = static_cast<int> (AMOUNT_TO_INCREMENT_BY % numberOfThreads);
  
  // Different places to stick the data in when finished calculating.
  std::vector<long long> data(numberOfThreads, 0);
  
  std::clock_t timeVal = std::clock();
  
  // I construct numberOfThreads threads, each with their allocated parameters.
  std::vector<std::thread> myThreads;
  myThreads.reserve(numberOfThreads);
  for (int i = 0; i < numberOfThreads; ++i) {
    myThreads.push_back(std::thread(NumberIterator, &data[i], amountToIncrementByPerThread));
  }
  
  // Wait for the threads to complete their main functions.
  for (auto &i: myThreads) {
    i.join();
  }
  
  // Add the remaining numbers.
  for (auto &i: data) {
    number += i;
  }
  NumberIterator(&number, amountLeftOver);
  
  timeVal = std::clock() - timeVal;
  
  std::cout << "Total should be " << AMOUNT_TO_INCREMENT_BY << ", is " << number << ".\n";
  std::cout << "Completed in " << timeVal * 1000 / CLOCKS_PER_SEC << "ms.\n";
  
  return 0;
}


As usual, CL.exe is providing desirable times. It's ever so slightly faster than before after preventing the false sharing.
Oh yeah, one more thing: std::clock measures CPU time, not wall clock. In ideal case, total CPU time for multiple threads should be exactly the same as the CPU time for a single thread.


Optimisations really mess with the program. -O2 and -O3 guess the game and just spit out the answer straight away

You can defeat that by changing the signature of NumberIterator to
void NumberIterator(volatile long long *number, long long count)

How would I give them reasonable distance? Allocate separate heap storage for each number?

You could just pad them to cache line size, but, as you already found out, if your threads operate on independent data, they certainly can have each their own copy of data to work on.

And here's a lovely and easy solution I'm surprised I didn't think of:

A more canonical C++ solution would be to use std::async

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <string>
#include <vector>
#include <chrono>
#include <future>

const long long AMOUNT_TO_INCREMENT_BY = 5000000000LL;

// volatile to defeat optimization
long long NumberIterator(volatile long long number, long long count)
{ 
  count += number;
  while (number < count) ++number;
  return number;
}

int main(int argc, char **argv)
{ 
  if (argc != 2) {
    std::cerr << "WRONG NUMBER OF ARGUMENTS!  Must have a single integer argument.\n";
    return 0;
  }

  unsigned numberOfThreads = std::stoi(argv[1]);;

  if (numberOfThreads == 0) {
    std::cerr << "ERROR!  Number of threads needs to be greater than zero.\n";
    return 0;
  }

  if (numberOfThreads > std::thread::hardware_concurrency())
    std::cerr << "WARNING: number of threads requested (" << numberOfThreads
              << ") exceeds hardware concurrency of " << std::thread::hardware_concurrency() << '\n';

  long long
    number = 0,
    amountToIncrementByPerThread = AMOUNT_TO_INCREMENT_BY / numberOfThreads;
  int amountLeftOver = static_cast<int> (AMOUNT_TO_INCREMENT_BY % numberOfThreads);

  std::vector<std::future<long long>> myTasks;
  myTasks.reserve(numberOfThreads);

  auto timeVal = std::chrono::steady_clock::now();

  for (unsigned int i = 0; i < numberOfThreads; ++i)
    myTasks.emplace_back( std::async(std::launch::async, NumberIterator, 0, amountToIncrementByPerThread) );

  // Here I wait for the threads to complete their main functions.
  for (auto &i: myTasks)
    number += i.get();
  number = NumberIterator(number, amountLeftOver);

  std::chrono::duration<double> timePassed = std::chrono::steady_clock::now() - timeVal;

  std::cout << "Total should be " << AMOUNT_TO_INCREMENT_BY << ", is " << number << ".\n"
            << "Completed in " << timePassed.count() << "s.\n";
}


With that I get, on my clunky first-gen Core i7, with g++ -O3 -march=native

Threads        1    2    3    4    5    6    7    8
wall time, s 12.21 6.25 4.21 3.14 2.51 2.00 1.69 1.41
Yep, using a stopwatch I realised std::clock seems to be implemented very differently across platforms. Looks like MS had it as a timing device incorrectly... I think I'll just use time from now on.

And looks like std::async pretty much prevents the whole issue with fake sharing.

Well then, thanks a lot! Your responses have been incredibly helpful and instructive!
And thanks again, looked up the -march=native option and that looks quite handy for personal code!
Topic archived. No new replies allowed.