Multithreading with async, only a 4x's speed up?

I am new to multithreading in C++11 with Visual Studio. I am trying to test the speedup of my code using async, and I am only getting a 4x's speed up (when I subtract the time taken to initialize the threads ~0.22us for each loop of the threaded process). However, I know my machine can have 8 possible threads, so shouldn't I see an 8 fold increase? Am I misunderstanding the capabilities of multithreading, or is the error in my code?

L_d is a class with a matrix, the asv's breaks it into rows, loop_kernel is a function multiplying elements along a row together and putting the result back into L_d.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
double time1=(double)clock();
			for(int i=0;i<1000;i++){
			auto as1 = async(loop_kernel, L_d,0,asv1);
                        //start row and end row, from 0 to asv1
			auto as2 = async(loop_kernel, L_d,asv1,asv2);
			auto as3 = async(loop_kernel, L_d,asv2,asv3);
			auto as4 = async(loop_kernel, L_d,asv3,asv4);
			auto as5 = async(loop_kernel, L_d,asv4,asv5);
		        auto as6 = async(loop_kernel, L_d,asv5,asv6);
                 	auto as7 = async(loop_kernel, L_d,asv6,asv7);

	                loop_kernel(L_d,asv7,*L_d.M);//use main processor
			as1.get();
			as2.get();
			as3.get();
			as4.get();
			as5.get();
			as6.get();
			as7.get();
		
			}
	double	timediff=(double)(clock()-time1)/CLOCKS_PER_SEC;
		cout<<"time for 8 threads: "<<timediff<<endl;
		time1=(double)clock();
		for (int i=0;i<1000;i++)
			loop_kernel(L_d,0,M);//no syncs
			timediff=(double)(clock()-time1)/CLOCKS_PER_SEC;
		cout<<"time for serial: "<<timediff<<endl;

Do you notice any difference if you do the "serial" test before the asyncs?
It goes from 4x's to 4.7x's....?
Last edited on
You called the second test serial. "time for serial".
Threading does not guarantee a speed up equal to the number of threads/cores you run your code on. You have to take in to account things like memory access, processor scheduling, cache swapping, cache misses, pre-fetching etc.
Ok thanks. I just thought something was very wrong with this, especially since I got the same speed up if I used just four threads (and cut the serial in half). I guess if I want this to go faster I will need to look for solutions that are not related to how I called async.
Are there no dependencies between these threads? Also remember that the OS is in charge of delegating thread resources. It may not want to give up all CPUs for all the threads for a single process. Since this is also still in its infancy, there are likely to be bugs/caveats.

If you are using windows, this may be of use:

Not sure std::clock is suitable for measure this kind of thing.

http://en.cppreference.com/w/cpp/chrono/c/clock
std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.
I tried another clock, http://www.cplusplus.com/reference/chrono/steady_clock/
but got the same results. I checked the thread information on the Microsoft webpage mentioned above. It appears to be a very involved process, which might not provide me with faster threads anyway. I am somewhat satisfied to know that my use of async is not the problem here. I think what I will end up doing, is testing the code on different machines.

Thanks everyone for your input, it is very much appreciated.
Topic archived. No new replies allowed.