efficiency in threaded processing

Ok, so I have a program in which I execute a series of transforms on a set of images (crop, resize, rotate, etc). It's threaded, so it will process multiple images at a time (in my case, 8, since I have a quad-core i7). however, once it gets to the end of the set of images, there is a period of "bleed-off". originally, I had no optimization for this, so it would go from the last 8 images to 7, 6, 5, ... until it was stuck processing the last one image. The program had to wait until the last image was fully processed until moving to the next stage (compression/optimization).

I'm now considering several ways to reduce the impact of this bleed-off period, however my concern is that all of them introduce some level of overhead. thus, the larger the set of images, the bigger the overhead, and the smaller the benefit (as the bleedoff period becomes less and less of the overall execution time). on the other hand, the smaller the set is, the shorter the overall execution time, so the less it really needs optimization (if it only takes 30 seconds to execute, plus a 10 second bleedoff time, that's +33% execution time, but it's only 10 seconds, so meh). so my question is, how much overhead do these introduce? or in other words, is it worth it?

option 1: implement file locking, to allow the program to move from each step to the next before all threads are done processing, and it will simply pause a thread if it hits a file which is locked (not done with crop/rotation/etc). overhead includes a mutex for each file, and checking/locking/unlocking it.

option 2: sort files by size and operate on the largest files first, ensuring that the last files are the smallest ones and therefore bleedoff time is minimized. overhead includes the initial sort, and possibly further sorts as the processing continues (eg, resorting after crop as the image sizes may change drastically after that step). so, basically, one or more vector sorts (for which, again, the overhead increases as the image set increases).

there are, of course, several other options. I know that this falls into the "it depends on the situation" category to a degree, but I'm looking for a general opinion on the overhead (I have a good idea of the bleedoff cost). if mutex creation/checking/locking/unlocking has practically no overhead (similar to an if statement), then it's probably worth implementing, whereas if it's heavy, then it's probably not.
I'm not altogether clear on what exactly you've done. But it reminds me of a talk Rob Pike gave called "Concurrency is not Parallelism".
http://vimeo.com/49718712
essentially, here is a simplified flow:

(in main)
for(first file;last file;increment file)
wait until current threads < maximum threads
increment threadcount
launch new image processing thread (pass the current filename)
wait until all threads finish
add all files to a zip file

(image processing function)[filename was passed]
if(rotate)rotate file
if(crop)crop file
if(resize)resize file
if(greyscale)greyscale file
if(posterize)posterize file
decrement threadcount

now, this is all well and good -- no, it's not 100% efficient, but it's close enough. in essence, instead of processing one image at a time, it processes x images at a time, with very little overhead (due to use of a semaphore). the issue arises when the number of active threads, y, is < the maximum number of threads, x. theoretically, this happens whenever a thread finishes (and when starting the first x threads), but due to the tight thread-launching loop, this is minimal (likely in the nanoseconds range). the big time-killer is the line "wait until all threads finish". in theory, in a worst case scenario, this constitutes a performance hit of x/z-1 (where z is the number of files to be processed).

the issue is that the "fix" for this, whatever it is (and there are many options), will introduce overhead. this overhead (in absolute terms in every case, and also relative terms in some cases) will scale with z, ie the overhead is a*z (sometimes a^2*z) (where a is the overhead per file or iteration).

so the question is, are the fixes diminishing the bleedoff penalty of (worst case) x/z-1) enough to account for the overhead penalty of a*z? obviously, the answer is "it depends on z" (the number of files), but given that z changes each run, I need to get an idea of a in order to come up with a general solution (I already have an idea of X).

an example would be:

(in main)
for(first file;last file;increment file)
wait until current threads < maximum threads
increment threadcount
lock (filename)
launch new image processing thread (pass the current filename)
for(first file;last file;increment file)
wait until current file is unlocked
add file to zip

(image processing function)[filename was passed]
if(rotate)rotate file
if(crop)crop file
if(resize)resize file
if(greyscale)greyscale file
if(posterize)posterize file
unlock (filename)
decrement threadcount

as you can see, this removes the "wait until all threads finish" roadblock -- reducing the "bleedoff" penalty to, in a best-case scenario (which most if not all large image sets would be), 0. however, it adds overhead by 1) requiring file locks to be created, 2) requiring each file to be locked and unlocked, and 3) requiring the zip section to check if each file is locked. in this case, I'd be using semaphores, so the question becomes how much overhead creating, locking, unlocking, and checking the lock state of a semaphore introduces.
Last edited on
can't your processing threads just go grab the next file from the list of files? e.g. fill a non blocking concurrent queue with file names (or some equivalent), each thread tries to pop, does processing and loops if it was able to, quits if the queue is empty.
that's close enough to what it does. there are a couple of technical reasons that I've decided to go with a main thread that spawns child threads -- one being that the said child threads themselves spawn child threads, which themselves can multi-thread if need be (which wouldn't be possible if I just had X threads).

however, either way, the same problem would occur -- unless I misunderstood and you meant to just launch a thread for every file (at the same time)... which would be a disaster -- not only would it set off cache thrashing, it would quickly fill up and overflow the system RAM. this program is meant to process hundreds to thousands of images at a time.

but assuming my first paragraph is correct, the same problem would remain -- the last 8 threads would finish one by one, until the last thread was running alone -- and blocking the remainder of the program from running until it finished (which I've termed "bleedoff" because you need to bleed off those remaining threads before moving on).
When processing thousands of images, does it really matter that the last seven might under-utilize CPU? If processing time grows with size, predictably, then yes, largest files should go first. It takes negligible time (compared to spawning a single thread, which you seem to be doing gratuitously) to sort a vector holding a thousand file sizes.

Also, you don't need file locks to start adding the processed files to zip as soon as there is at least one processed file. Only the archive-building thread, not the entire filesystem, needs to know which files are ready, the processing threads can tell it using anything from message queues to per-file atomic 'done' flags, to condition variables, etc
ah, excellent! that is the kind of stuff that I wanted to know -- being nearly entirely self-taught, I don't know that sorting a vector is fairly light on overhead (in this context). i know how to sort a vector in C++, but not how expensive it is under the hood.

you're right in that given even 1,000 files and 8 threads, worst-case would be 8/999, or a mere .8% -- and I specifically stated that this was intended for sets of hundreds to thousands (a set of 100 with 8 threads would be ~8.1%). but as A) the number of threads grows, and B) the set of images shrinks, this can inflate greatly -- and I'd like to at least decently future-proof and consider a wide usage. in 4-5 years, when people have 8-16 core CPU's (16-32 threads based on current trends), even a set of 500 files would lose 3-6% -- and a set of 50 files would take a massive hit of 33-65%. not to mention even farther down the road, or smaller sets.

as for file locks, I meant simply adding a mutex to the filename array, and using it as a "file lock" -- not the actual boost file lock implementation (which would be problematic for me based on functionality alone, plus it would be heavier). i assume that would be lighter than message queues (though am not entirely certain).

anyways, you've answered my question -- which is what I figured to be, but I wanted to make sure before I wrote thousands of lines of code and found it to unacceptably bloat the end product.

on a side note, can you expound on the comment about spawning threads? granted, I am doing it gratuitously, but it seems like you're implying that doing so is extremely costly (or perhaps I'm misreading it). is that the case? if we're talking about "heavy" in relative terms compared to, say, not spawning threads, then it's no problem -- the program is doing some seriously heavy lifting in terms of compression, etc -- so a few kb or seconds here and there in overhead from threads is worth it to ensure full saturation (and a logical program structure, at least to my mind). if, on the other hand, we're talking serious hits here, then that's another matter. Also, it may (or may not) be relevant to mention that I'm using global variables to make large memory objects (ie, images) available to each successive thread in a chain (so I'm not recreating the same variables for each thread).
Last edited on
can you expound on the comment about spawning threads?

It's just such a major piece of activity for the OS to do: setting up virtual memory for the new thread, execution context, allocating the stack, scheduling. And you, as a programmer, have little control over it.

Actually I may have exaggerated: I ran a few microbenchmarks right now, and spawning one thread is comparable to sorting a vector with a thousand ints on my hardware. It's still several times longer than sending a condvar message to a thread that's already up.
      spawn  join   sort msg/condvar msg/spinlock
linux  39 us 30 us  51 us   11 us    1.8 us
sun    67 us 47 us 110 us   27 us    0.8 us
ibm    21 us 16 us  40 us  5.5 us    1.2 us

code: http://ideone.com/Q3tFQO (on linux, link with -pthread -lrt) -- sorry for pthreads, I was curious to try it where I don't have a C++ threads implementaiton.

EDIT: added lock-free spinlocks for some nice contrast
Last edited on
Why not consider the user of a thread pool?

You can spawn your threads once, put objects in to a thread-safe queue that get pulled out and worked on. This would give you an effective execution time of something like 99.X% because of the lower overhead in putting objects in to the queue's source job list.
cubbi: again, that's perfect, thanks! in the context of my program, where for instance a single optimization run may take 30 seconds (and there are often dozens if not more per file), something under 100ms is pretty much insignificant (though of course, in some programs, that could cause major slowdowns).

zaita:

well, using a thread pool wouldn't address the bleedoff issue, but in theory would be more efficient than spawning new ones all the time. it would make the code considerably more complex, however.

essentially, the way the program runs now, there are dozens of different operations that it carries out. the options for each operation, as well as whether to even execute it, are fed in by the user (via an ini file or commandline arguments). some options can have an indefinite number of different settings (eg, resize), which then "forks" the flow. the order of operations is non-changeable.

as a simple example, we have rotate, crop, and resize. the options given might specify 1 rotation (of 90 degrees), no crop, and two resizes, in which case the image in memory would be rotated, copied, and the two different copies independently resized. that's simple enough, but if you've got for instance 2 rotates, 3 crops, and 4 resizes, you wind up with 24 outputs per file. granted, I could take care of that all with a vector for the memory objects, and loops to iterate each setting through each file, but (I at least) find it a lot easier to visualize by doing it this way:

launch 2 rotate threads, each with a copy of the original image
each rotate thread executes, then launches 3 crop threads, which execute and launch 4 resize threads.

the programmatic disadvantage is that one thread can quickly balloon into many, but this is also the advantage -- it means that you can process all the different variations of a single file in parallel rather than one-by-one. plus, it means I can code separate functions for each operation (eg, rotate function which rotates), and then just launch the function in a thread, which is a lot easier than loading different functions into a pooled thread.

the good news is that nearly everything is thread-proof (doesn't need to be made thread-safe), as each step takes the previous one's output -- it doesn't reference outside data, so it's more like a pipeline. the only things referenced by multiple threads are things like the settings (eg for resize), which are only read operations (these are initialized at the beginning of the program and never changed).
Last edited on
Topic archived. No new replies allowed.