Will muti threading provide any proformance boosts?

Hello everyone,

I am new to programming in general so please keep that in mind when you answer my question.

I have a program that takes a large 3D array (1 billion elements) and sums up elements along the various axis to produce a 2D array of a projection of each side of the data. The problem here is that it is very ram intensive as the program is constantly fetching information from the ram, both reading and writing.

The question is, will i gain any performance increases if i mutithread the program or will I end up running into a RAM access bottleneck? When i say mutithreading, i only mean mutithreading for 2 or 4 cores, no more.

Thanks in advanced,

-Faken
Last edited on
No, not really.

The task must be performed, if not in one thread then in another. So now you've got the overhead of managing and swapping threads in addition to doing the calculations.

Unless you have a dual-core CPU or the like, (making the calculations occur concurrently) then it isn't worth the effort. (Multiple threads != concurrence -- a single CPU can still only execute one thread at a time.)
Last edited on
I take it this is running on a 64bit box (as you've exeeded the 32bit address space).

If the system is paging due to access data, then making it multi-threaded will at best increase the page demand, making the application slower.

Can you not revise your algorithm to use a smaller data set? That'd be the way to improve performance.
Thanks for the responses,

I think i was a little misunderstood.

I am running a quad core processor but I only have a 32-bit OS...which is giving me some problems (I'm taking steps to fix that right now)

My question is not if an extra thread will help in the actual computation, which, i know it will. My problem is I'm worried that since my application is so memory intensive (mostly reading) that muti threading will not help because of RAM access bottlenecks. My question is will I hit any bottlenecks due to RAM access if I go from 1 thread to 2 or 4 parallel threads (since my CPU is quad core, the more threads, the more the ram will be accessed in proportion, so 4 threads means the ram will be asked for things 4 times more than 1 thread, or require 4 times the bandwidth).


An extra bit of information, when I run my program, it will take up all the processing power of a single core (which is expected). Does it mean that the CPU's computational power is the bottleneck since it is working at 100% power and there is still more room to utilize the RAM even more from another core and another thread?
i don't think it will help much.

Ram is much slower than the processor and it has a fixed bandwidth of course. So one more thread that does calculus will just wait for the ram to get data.

Try for(int i=0;i<100000000;i++); it will do that in a blink because i is kept in the processor cache(extremely fast) but it will go a lot slower if you would have to access different data at every step.

here is an example:
1
2
3
4
5
6
int main(){
    int* x=new int[100000000];
    int y=0;
    //for(int i=0;i<100000000;i++) {int *temp=x+i;y++;}                     //case 1 : temp,x,i,y are kept in cache ; runtime about 0.05 sec
    for(int i=0;i<100000000;i++) {int *temp=x+i;(*temp)++;}            //case 2 : temp,x,i, are kept in cache,*temp read from ram;; runtime about 0.5 sec
}
Last edited on
csiz, thank you very much for your answer. That example is actually very meaningful and useful.

It shows me that the speed of the RAM access is actually a little faster than I had first expected and that I have a fair amount of headroom in the RAM bandwidth to work with.

So from what i gather so far, I should be able to take advantage of at least 2, maybe 3 parallel threads and still not run into the RAM bottleneck yet, of course, hitting the RAM bottleneck would actually be ideal here as that would mean my program would be running as fast as physically possible on my machine.


Just to get a few things clear, in your example, on the second case, it is doing equal number of RAM read and writes, right?
I don't think it's right to think of the CPU and the RAM as separate entities when writing in C/++. Sure, you could think of core cache and all that, but since you have no way to control caching and you don't know how caching is implemented if at all, you're just adding more complications to the code.
"RAM bottleneck" is not a useful concept. RAM is always a bottleneck in any computation that doesn't use I/O, and is almost always using its full bandwidth (actually, it uses it to its full in bursts) because the CPU is capable of processing data faster than RAM can provide it.

I've never seen a parallelizable problem where RAM was a meaningful hindrance in the speed scaling. It's more common that the problem cannot be fully parallelized and threads need to be synchronized at certain points to access shared resources.
An embarrassingly parallel problem will run almost n times faster on n cores than it will on 1 core.
Well, i guess my application can be considered an embarrassingly parallel problem.

I have a 3D array being collapsed into 2D projections. Each element of the 2D array can be done independently by a separate thread all in alternating turns and the synchronized at the end to produce an output file.

Or since this I will be producing a number of output files based on the number of "steps" contained inside that 3D array, i could have different threads handling different projection steps (each thread works independently to produce it's own output files).

Which method would be more efficient? Which method is easier to implement?
The first one is most likely easier to implement.
Most of the time, it's easier to parallelize a single task (for example, have one thread process the odd elements in an array and the other the even elements, or have one process the first half and the other the second half) than to run two different tasks at the same time. Plus, two different tasks may not take the same time to finish, which means they must be synchronized at the end, wasting some time. In this sense, the first method would be faster.
This only applies when all the tasks can be easily parallelized. If some of the tasks can't, it may be better to run those in parallel (assuming, of course, said tasks are not interdependent).
Thank you very much Helios!

Now all that's left is for me to do some more research on the topic and implement it!
Topic archived. No new replies allowed.