Four threads do not perform four time faster in quad-core!!!


Hi,

I am just testing simple parallel programming, using multiple working threads on a quad-core pc -- very simple stuff!!

.- I can specify the number of working threads started, but never use more working threads that cores (four in my quad-core pc)
.- I made sure the dummy task takes so long, that the sync part of the code (critical section) is neglectable.
.- I am measuring time across threads (on each task step) and on the main thread using clock()
.- The task is very simple – I am sure it does not require data or instructions from out of cache

What annoys me is that the overall inefficiency of multi-thread seems far more than expected
.- Some threads (or cores) seem to be lazy – they take (up to 20%) longer to complete the same working steps.
.- All threads seem to take longer on the same working steps, as the number of working threads increase.
.- Using the task manager, I can see near under 3% load on idle.

Just as a summary, here is the elapse time of the working task.
.- 9390 mS with 4 threads (30% -- 20% off ideal)
.- 16331 mS with 2 threads (53% -- 6% off ideal)
.- 30757 mS with 1 threads (100%)

Any comments?

If anyone want to see the code, let me know...

Thanks.
I can't really justify numbers down to the last digit. I can only say that creating a thread has its implications. There's always some cost associated to creating and destroying a thread, just like any other operating system resource.

You also say the synchronization cost is neglible, but how neglible is it, really? Did you measure it?

One must also remember that the OS has to schedule other threads for other processes, and the scheduling algorithm is quite complex and depends on several factors. I think a more fair test would be to create threads with processor affinity and with the highest priority to minimize the impact of the scheduler and to actually determine if the selected core is an important factor here. To be complete you would have to repeat the experiment but rotating the work among different cores to ensure the apparent delays in specific cores is not tied to the type of work being done.

I guess in general you should design a formal experiment with sound statistical analysis, such as Factorial. I don't think one can categorically speak about issues without actually following the scientific method.
Hi webJose.

Thanks for your comments.

The truths, I am much more interesting in the actual overall figure of the parallelism inefficiency I have encountered on my PC, than the accuracy of the measurement of it.


Nevertheless,
.- I made sure I measured net task time, and exclude thread creation and destruction.
.- I tried to measure the execution of the critical section part, and I got less than 1mS (as it should be!!)
.- I set it up, so that the task step takes well over 1000mS, so that every thing else could be negleted.
.- As for statistics, I am measuring and comparing the average elapse time of each working steps on each thread. (I don´t know how to assign a thread to a particular core – that’ll be interesting) and of course I ran the program several times.

I was hoping to find someone else that had noticed that kind of departure from ideal parallel performance (around 20%), and not blamed it on: task fragmentation, the scheduler, the synchronization overhead, issues with cache, or even other loads in the computer.

In my test setup, all these factors should not account for 20%!!!

There is got to be something else… So far, I feel cheated with my PC and its quad-core processor.

Just to set a reference, long time ago, I used to do some heavy computing on a Silicon Graphics with 16 processors (available to me while studying at UIUC) and these kind of simple parallelism (task steps independent from each other) performed very close to ideal. It was, of course, UNIX (Iris) instead of Windows, gcc instead of visualC++, processes instead of threads, etc.

Somehow, I figure by now (about 15 years later), Intel/Microsoft had caught up. But if a simple parallelism scheme yields that king of results, just imagine the performance deterioration on complex schemes.
I have used openCL with an intel I5 on windows 7, and achieved almost exactly 4 times faster code. The same thing was a little bit slower using openMP, but still almost 4 times faster. This was for processing large vectors of audio samples.
Last edited on
Share your test code. Let's examine that and go from there.
Here is the code. If there is a better way of sharing code, please let me know.



// Prueba1.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <ctime>

#define N_THREADS 4 // <--- change this to test different scenarios!!!!
#define NUMBER1 24
#define NUMBER2 8000000

float data[NUMBER1];
int thread_id[NUMBER1];
int thread_time0[NUMBER1];
int thread_time1[NUMBER1];
volatile int counter = 0;
CRITICAL_SECTION critical;

float taskStep(int n)
{
float x;
x = 1;
for(int i = 1; i<NUMBER2; i++) {
x *= i;
}
return x;
}

unsigned int __stdcall theThread(void*)
{
while (counter < NUMBER1) {

EnterCriticalSection(&critical);
int number = counter++;
LeaveCriticalSection(&critical);

thread_time0[number] = clock();
data[number] = taskStep(number);
thread_id[number] = (int)GetCurrentThreadId();
thread_time1[number] = clock();

}
return 0;
}





int _tmain(int argc, _TCHAR* argv[])
{
HANDLE threadHandle[N_THREADS];
int i, k, tmp;
int t0, t1, t_acc, t_n;

InitializeCriticalSection(&critical);

t0 = clock();
for(i=0;i<N_THREADS;i++) {
threadHandle[i] = (HANDLE)_beginthreadex(0, 0, &theThread, (void*)0, 0, 0);
}
for(i=0;i<N_THREADS;i++) {
WaitForSingleObject(threadHandle[i], INFINITE);
}
t1 = clock();

for(i=0;i<N_THREADS;i++) {
CloseHandle(threadHandle[i]);
}
DeleteCriticalSection(&critical);

printf("t0 = %d\n\n", t0);
tmp = -1;
for(k=0;k<N_THREADS;k++) {
for( i=tmp+1; i<NUMBER1 && tmp!=-1 && thread_id[tmp] == thread_id[i]; i++) {}
tmp = i;
printf("Thread %d:\n", thread_id[tmp]);
printf("Step t0 t1 t1-t0\n", thread_id[tmp]);
t_acc = t_n = 0;

for(i=0; i<NUMBER1; i++) {
if( thread_id[tmp]==thread_id[i]) {
printf("%3d: %6d %6d %6d\n", i, thread_time0[i], thread_time1[i], thread_time1[i]-thread_time0[i]);
t_acc += thread_time1[i]-thread_time0[i];
t_n++;
}
}
printf(" %5d / %2d = %5d\n", t_acc, t_n, t_acc/t_n);
printf("\n\n");
}
printf("t1 = %d\n\n", t1);

getchar();
return 0;
}
Why this:
1
2
3
for(i=0;i<N_THREADS;i++) {
    WaitForSingleObject(threadHandle[i], INFINITE);
}
and not this:
 
WaitForMultipleObjects(N_THREADS, threadHandle, TRUE, INFINITE);


You should pass in i to each thread so they don't have to do this:
1
2
3
EnterCriticalSection(&critical);
int number = counter++;
LeaveCriticalSection(&critical);
and there is InterlockedIncrement().

clock() probably sync's your threads. You should consider using QueryPerformanceCounter() for timings, but be aware each core has it's own counter, but overall you should get reliable values.
This code:
1
2
3
EnterCriticalSection(&critical);
int number = counter++;
LeaveCriticalSection(&critical);


could be replaced by InterlockedIncrement(), which is an atomic operation.

Also, create all 4 threads in a suspended state, use thread affinity functions to force operating system to execute one thread on a specific core. Only then start the benchmark.
Also, create all 4 threads in a suspended state, use thread affinity functions to force operating system to execute one thread on a specific core.
Don't do that. Don't interfere with the OS scheduler. This is a benchmark after all, the only optimisations should be to remove syncs.
@OP: Do you know what the term "Deadlock" means in this context? How about what the function "EnterCriticalSection()" even does? kbw and modoran both mentioned "InterlockedIncrement()" as a suggestion for replacing the critical section parts of your threads but for your stated purposes it HAS to be done. Otherwise you are literally working against your own threads.

Also don't bother assigning the threads to cores, there's no reason to believe that you are better at micromanaging this kind of stuff then your scheduler is.
For the record: I only suggested affinity because the OP hinted he was suspicious about a particular core(s) being lazy. I also don't support processor affinity in a regular multithreading environment.
Great! Thanks!!!

This would be my version using InterlockedIncrement()


unsigned int __stdcall theThread(void*)
{
long n;
while((n=InterlockedIncrement(&counter)-1) < NUMBER1) {
thread_time0[n] = clock();
taskStep(n);
thread_id[n] = (int)GetCurrentThreadId();
thread_time1[n] = clock();
}
return 0;
}

This eliminates the possibility of a thread excetuting on an "n" larger-or-equal to NUMBER1. That was possible before, as by the time the critical section was reached, counter might have been incremented by another thread.

I am new to thsi API. After reading, I realized that I actually wanted InterlockedAdd(), as I had in mind passing chunks of tasks to the threads instead of just one, but some how it does not seem to be implemented. Is this right?

Computergeek01: I don´t see the possibility of a deadlock in the implementation of the counter-increment using CriticalSection's -- the counter gets read and incremented exclusively by one thread at a time.
You should pass in the array index when you create the thread. You're causing threads to wait/CPU to spin just for for the sake of not passing it in.
kbw : I don´t agree, or I don't understand what you mean.

The whole point of having each thread to increment the counter is to have each thread execute as many work-steps as it can. If a thread (or core) happens to be more loaded, it would end up executing fewer task-steps than the others. That in overall is the only way to ensure the best performance. If I pre-assign each working thread with a fix number of task-steps, the main thread will still need to wait for the slowest working thread, even is the faster ones are done and idle. This is especially important is the work-steps are of variable complexity.

On the other hand, the only wasted time is the one some working threads might spend waiting for InterlockedIncrement() to complete on other working threads. That is if they happen at the same time, which is very infrequent.
Sorry, my mistake. I just took a detailed look at your code when formatted.

But I think you may have a fundamental misunderstanding of the way the scheduler works and that your code could be a more representative test of parallelism if it didn't sync with the others so frequently.

If a thread (or core) happens to be more loaded,
There's no relationship between what a thread is doing and what a CPU core is doing. The scheduler decides what's best, your thread isn't bound to a core. Also, your program isn't the only thing being run on the computer and they need scheduling too.

the only wasted time is the one some working threads might spend waiting for InterlockedIncrement() to complete on other working threads. That is if they happen at the same time,
All the threads are doing the same thing, so they'll tend hit sync points at the same time. So they'll tend to sync themselves.

As all the threads are doing the same thing, a fairer test would be to make each thread just run taskStep() a fixed number of times without synchronising with other threads.
Last edited on
Kbw: You´re right. The whole system is the one that get loaded, not the threads.

What I had in mind is that when the system gets loaded, each working-thread gets a reduced amount of processor (core) time. They take longer to complete the working-steps. These are the ones I called “lazy threads.”

My real concern (the need for synchronization) is regarding the balance among all working-threads. The main thread always waits for the completion of all working-threads. The task is done only when the slowest thread is done.

I agree that the scheduler decides what is best, but it does not look like it ensures balance assigning processor time assigned to my working-threads.

The printout of the code I posted shows the time at the beginning and end of each working-step of each working-thread. They don´t seem balanced or synchronized.

Topic archived. No new replies allowed.