Threading Advice

Pages: 12
From this post:
http://www.cplusplus.com/forum/lounge/109303/#msg595161

I'd like to create the said program while trying to integrate threading. Now, I know threading is a difficult concept to just up and learn, but I believe I've understood enough of the basic concept that I can attempt this. I would like some advice on what is and what isn't needed though.

Currently, running it through my mind, I've come up with three main threads. My first one will scan the selected directory for all corresponding directories and sub-directories looking for valid files and add them to a list of path pointers. The next thread will be one that handles and displays a progress bar, similar to that of cire's example in the same thread. And the last being a thread that copies the files over.

I have some questions about this though. While threading, do you need to use mutexes for variables that aren't written to, specifically my list of valid extensions that I am searching for? Is it feasible to have more than one thread running the same function, specifically having two threads searching for files, one starting from the top, one starting from the bottom? Would it be more feasible to have one thread search directories for files and pass said files to a different thread to evaluate the extensions?

I'm a little confused as to what way would provide optimal performance through the entire process.
Doesn't seem like the place to use threads.

The next thread will be one that handles and displays a progress bar
Displaying a progress bar doesn't need to happen in a thread separate from the one that performs the operation. In fact, to avoid thread synchronization, it's best if it doesn't. You simply structure your program like so:
1
2
3
4
while (state.not_done()){
    progress_t progress = state.perform_next_step();
    progress_bar.update(progress);
}
update() knows not to update the progress bar if not enough progress has been made since the last update.

And the last being a thread that copies the files over.
This function needs the scan to have completed first, so no point in using a separate thread. Even if it didn't, performing I/O with a single device on separate threads is usually a bad idea in terms of performance, specially when talking about hard drives.

Is it feasible to have more than one thread running the same function, specifically having two threads searching for files, one starting from the top, one starting from the bottom?
It's possible and quite common, but your example is not a good one for the reason I outlined above. Unless your directory hierarchy is spread across multiple devices (e.g. if there's some symbolic linking or RAID or odd directory mounting configurations), throwing more threads at the problem won't increase performance.

Would it be more feasible to have one thread search directories for files and pass said files to a different thread to evaluate the extensions?
This is more sensible, since the discrimination doesn't require I/O. Still, scanning a complete directory hierarchy is such an expensive operation that parallelizing this won't make much of a difference.

My advise is to start out with something more CPU-intensive and highly parallel, like image processing/rendering. Say, some basic ray tracing. Anything with a linear procedure to follow, where each step needs the result from the previous step will parallelize poorly.

EDIT:
While threading, do you need to use mutexes for variables that aren't written to, specifically my list of valid extensions that I am searching for?
Reads only are safe. Writes only are safe (but variables and objects are almost never written without being read first). Reads combined with writes can be safe (e.g. using a global volatile flag to tell a thread to stop is only theoretically unsafe), but are generally not (e.g. incrementing a global counter) on account of the fact that a thread may be suspended or "preempted" at any time for any period of time.
Last edited on
Separation of concern is a very valid reason to use multiple threads - stuffing every action in the same loop of a sufficiently complex program is going to make it unmaintainable.
For educational experience, I would suggests a program that just has two threads: one produces a list of work items (path names, it sounds like?) and appends it to a thread-safe queue (Intel TBB library has some), the other thread consumes the items off the queue and processes them. Once you have that running, you can see for yourself how each thread is spending its time.

And yes, reading does not require synchronization as long as no thread can possibly ever attempt to write to that variable (in practice, it means your data was created before any of your reader threads were launched, and is unmodified until after all of them shut down)
Last edited on
It's not, though. Coroutines work just as well for that kind of thing without the problems implied by preemption. Threads are spiky things you don't want to mess with unless you really have to.
@helios
I understand, and greatly appreciate, your concern, but the way I look at it is if I play with fire, I take the risk of being burned. But on the same side, if I fear threads, I may never feel comfortable using them. I went years fearing pointers until I made the attempt of making a linked list, which proved to be an invaluable learning experience for me (I have a new found respect for them).

That being said, I already know enough that I need to respect the concept of concurrency and the vast dangers that can be associated with them, much like pointers. So far I've only thought linearly, and up until recently, it wasn't very practical to think otherwise.

I have reevaluated my thoughts on the program, and I'm already beginning to realize how using concurrency in a seemingly simple program can greatly complicate things. I have narrowed my want to use threads down slightly but have run into other concerns. See this post: http://www.cplusplus.com/forum/lounge/109303/#msg595682

On a side note, I still have some more research to do one concurrency in general, but I have a few questions about how C++ handles threads and the various functions and data types associated with them. I'm confused with the thread member functions join, detach, and swap. So far, I have learned that joining a thread forces the "parent" thread, if you will, to wait for that sibling thread to finish, essentially going from seemingly parallel processing to linear. From what I read, detach basically makes said sibling thread to go off on its own finishing as it pleases, not affecting the parent thread, aside from possibly editing global or other variables accessible by other threads. And as for swap, it basically seems like it takes one thread handle and another and swaps the job of both. I'd this a valid understanding on them? Where or when do detach and swap come into play? Join is easy, it basically forces desired threads to merge back into the parent thread.

I'm still trying to grasp the concept of concurrent variables too, specifically mutexs, and how to lock them and modify them. I read an article on SO that explained a mutex as a rubber chicken and allowed a roomful of people to pass said mutex around. Only the person with said chicken was able to modify the mutex. That concept makes sense, but do mutexs lock themselves, do the need to be specifically locked and unlocked, I'm leaning more towards the latter, or is there some other process? I also read up on latches and one other data type.

Finally, from my understanding, threads are handled in a similar fashion as C++ function pointers, meaning that variables can be passed to each thread upon creation, essentially eliminating the need for a global variable. Ideally, this seems most practical, however, if I pass a mutex by reference and handle the locking portion correctly in each thread, should I have any concerns about races?

Oops, and one last thing that I asked on my other thread, is there anyway to detect when a thread has finished its processing without the use of a separate variable, or is that about the only sure fire method?
I'd this a valid understanding on them?

essentially, yes
Where or when do detach and swap come into play

detach is rare. I've been doing concurrent programming for over 12 years and I think I've detached threads maybe once or twice, and I am not sure those were good design choices. For example, I had a 300+ thread real-time application that runs continuously (6 months to a year between new releases), and occasionally there would be a small job that none of the already running threads was allowed to do, so I would spawn a detached thread and let it run at low priority to finish whenever.

swap (and its C++11 cousin, move) is mostly to manipulate thread objects in the code that owns them, since they cannot be copied

do mutexs lock themselves, do the need to be specifically locked and unlocked, I'm leaning more towards the latter, or is there some other process?

POSIX mutexes need to be specifically locked and unlocked. C++/boost mutexes can be used that way as well, but their intended use is "some other process": lock objects: look up lock_guard and unique_lock

is there anyway to detect when a thread has finished its processing without the use of a separate variable

You can have it communicate that information over a condition variable, atomic variable, a promise, a message, a semaphore, or in some other way - inter-thread communication is a big topic.
Last edited on
Volatile Pulse: I didn't mean to imply that threads are so difficult to handle that you should never use them. My main point is that if you want to play with threads, do so in a domain where you'll notice an improvement from using them. For this application, parallelization is pretty pointless. rapidcoder has some fairly lucid points on that other thread.
You should read C++ Concurrency in Action. When you google it, the third hit is the full pdf.
I still need to do more reading on the implementation of concurrency so that could very well be my bed time book for the next few days or so.

@helios
I believe you're right, but it was a project idea that spawned from the concept of using a GUI so the thread idea seemed to be natural. I am really beginning to think this is almost best as a linear program, but if you, or someone else, could throw in an extra idea as to a way of using threads in the program, even if its a completely new feature, I'd greatly consider it since I'm becoming determined to use them.

@Cubbi
I still need to do a lot of research on the variables, but inter thread communication variables seem to be something new altogether. I'll have to learn more about them, but an atomic variable, from what I read before is similar to a static variable, correct? Is it also the most commonly used one?

As a possible future project, does anyone have any solid ideas for a concurrency learning program? I feel the problem I'm having is I'm trying to force parallelism onto linear programs, but I'm having a hard time understanding how true parallelism would really come in handy, aside from sockets. I've seen several learning programs, but they're too basic and not practical enough and I'm ready to get my feet wet, even if it means jumping in head first.
inter thread communication variables seem to be something new altogether. I'll have to learn more about them, but an atomic variable, from what I read before is similar to a static variable, correct? Is it also the most commonly used one?

no, condition variables are the most common communication primitive. See http://en.cppreference.com/w/cpp/thread/condition_variable#Example for example
if you, or someone else, could throw in an extra idea as to a way of using threads in the program, even if its a completely new feature, I'd greatly consider it since I'm becoming determined to use them.
Okay, I got one: Use a separate thread to create hash digests as you're copying. You'll need to figure out ways to:
1. Have a thread wait for a new workload or for the signal to quit.
2. Transmit workloads to the thread, and of different kinds ("a new file starts now", "here's a new buffer for the current file", "the file ends now").
3. Have the main thread wait for the workload to finish. Note that since the main thread will perform I/O, the other thread will very likely always finish first.
@Cubbi
I've been reason up on the use of the variables themselves, but I'm still confused about the reasoning behind them. I saw theyre basically signals to other threads to let them know to continue, essentially pausing a thread until that variable is in a "go ahead" state. It looks like it would be less processor intensive than a while loop on a Boolean, but will produce the same results coding wise. I'm reading the PDF that htirwin suggested and hopefully it will get me there.

@helios
I'm not sure i know what a hash digest is that you're referring to, the only thing google turned up is message digests using md5.
A hash digest is the result of applying a hash function to a string of bits. MD5 is a hash function. While I'm at it, the purpose of hashing files that are being backed up is error detection. The probability of a file and/or its MD5 digest being corrupted in such a way that when you hash the file again you get the stored digest, is 2^-128.
So, you're recommending creating an md5 hash for each file copied and checking it after its copies to ensure it copied properly? Another think i was wondering, since I'm at work and can't check myself at the moment, is there anyway to break a file into smaller chunks during the copy process using the boost filesystem? Similar to the way the OS's copy files, it has a progress bar as it transfers them so you know it isn't frozen or when you download a file online. I believe they're typically called segments, packets, or sections. I'm going to create a rough pseudo code once i get home to make sure I understand everything you're suggesting.

Also, readin up on md5, it was recommended to use SHA-2 instead as its supposed to be more accurate. Does it matter? I assume more accurate may mean more overhead as well.
So, you're recommending creating an md5 hash for each file copied and checking it after its copies to ensure it copied properly?
No, no. Just store it somewhere so it can be re-computed and compared upon backup restoration.

MD5 is proscribed for cryptographic purposes because it's possible to cause collisions relatively easily. For data integrity, any old hash is good enough, since storage devices malfunction at random.
Last edited on
WikiPedia wrote:
Hash functions are destructive, akin to lossy compression, as the original data is lost when hashed. Unlike compression algorithms, where something resembling the original data can be decompressed from compressed data, the goal of a hash value is to uniquely identify a reference to the object so that it can be retrieved in its entirety.


Two possible meanings of this is that:
1) the hash function finds specific parts of the file, i.e. the size, last modified, a specific section of the file, and converts it to a hash, having the size, last modified, etc., no longer known. Let's say this creates a simple hash of rfakjd54.

2) the hash function takes the entire file, in it's an entirety, and creates a hash function from it, deletes the file, and the file can be recreated.

To me, number 1 is more like a checksum approach, and seems to be more realistic, where as 2 seems to be more like a compression algorithm, and less likely to be what happens with a hash.

In this situation, is a checksum more realistic? Should I create my own algorithm to create a hash? I don't believe the boost filesystem supports a native hash function.

Aside from that, here is a basic run down of the things I've been considering doing: main thread launches the separate threads (mainly to keep the clutter down), I'll have one thread that scans the directories, adds the paths to a list, and notifies another thread that it's ok to process that item. Another thread will wait for notification, process the file (create a hash based off of that file), and pass onto another thread that it can copy, then waits confirmation that file completed and check the hash. A third thread to purely copy the files to and from, waiting for hashes to be created and then checked.

Maybe it is stretching it a little far, but I think that's a fair way to look at it, I'm concerned about my second spawned thread that either, it's not needed, or that it should be split into one for receiving a path and creating a hash of a file to be copied and one for receiving a path and creating a hash of a file that was copied.

I know this is stretching the want of using threads, but I have so many thoughts running through my mind at the moment, it's hard to piece it all together. I wanted to create some pseudo code before the weekend, but want to make sure the concept was good first.

Also, I mentioned it above, but do you know of anyway to copy a file in chunks without using a separate library? I don't NEED it, but I would like to implement it if possible.
#1 is more accurate. Hash functions take as input the entire file as a binary string and produce another binary string that identifies the file unidirectionally. That is, you can't recover the file from the digest, but if you hash another file and the two digests are equal, you can say that the two files are probably equal (the probability depends on the particular hash function and the size of its digest). If the hashes are different, the files are definitely different.

Checksums are like hash functions, in that they're both mappings of binary strings onto binary string.
Hash functions are primarily used to uniformly distribute binary strings along a number space, for varying purposes.
Checksums are primarily used for validation, error detection, and error correction, but they don't work for strings of arbitrary size, so the amount of checksum data required to correct errors in a string grows with the size of the string.
On the other hand, a hash digest is of fixed size, but can only tell you whether a string as a whole is different, not which parts of it are or how to fix them.

Should I create my own algorithm to create a hash?
No.

I don't believe the boost filesystem supports a native hash function.
Why would it? And what do you mean by "native"?

Another thread will wait for notification, process the file (create a hash based off of that file), and pass onto another thread that it can copy
Hash the file as you're copying it, otherwise you'll have to read each file twice. To do this, you can simply load chunks of it to memory with standard file I/O functions, do the processing, then write the chunk to the destination.

main thread launches the separate threads (mainly to keep the clutter down), I'll have one thread that scans the directories, adds the paths to a list, and notifies another thread that it's ok to process that item. Another thread will wait for notification, process the file (create a hash based off of that file), and pass onto another thread that it can copy, then waits confirmation that file completed and check the hash. A third thread to purely copy the files to and from, waiting for hashes to be created and then checked.
It seems like you're using threads to reinvent subroutine calling mechanisms. Instead of making calls, you're using IPC to pass data between functions. I'm going to call this "concurrent spaghettification of code" because it's like performing unstructured jumps across subroutine boundaries.
Your scheme could be losslessly translated to this (for the sake of argument, I'm omitting parallelization of hashing):
1
2
3
4
5
for (directory_iterator i = begin_scan(extensions), e = no_more_files(); i != e; ++i){
     hash_t h = hash_files(i.path());
     path_t dst = generate_destination(i.path());
     copy_file(dst, i.path());
}
None of the functions need to yield or perform any IPC. How would this be more cluttered than what you're proposing?
I don't believe the boost filesystem supports a native hash function.
Why would it? And what do you mean by "native"?

As in support for hashing a file. And I wasn't sure if this was a common thing or not. And if I'm not supposed to make my own hash function, should I just do some Googling to find a basic function? I believe the boost::filesystem is fairly limited to what information it can obtain from the file.

Hash the file as you're copying it, otherwise you'll have to read each file twice. To do this, you can simply load chunks of it to memory with standard file I/O functions, do the processing, then write the chunk to the destination.

I don't plan on copying the files using the file stream as opposed to using the boost::filesystem that has it's own copy function. My program doesn't open or read the file, per say, so I would only be opening it to hash it. Was your proposal based on manually copying the file from one fstream object to another? Wouldn't this be slower than using boost?

1
2
3
4
5
for (directory_iterator i = begin_scan(extensions), e = no_more_files(); i != e; ++i){
     hash_t h = hash_files(i.path());
     path_t dst = generate_destination(i.path());
     copy_file(dst, i.path());
}

I see what you mean and could just use my main thread (hashing, new destination, copying, checking hash, moving to the next one), and a helper thread to scan the directories.

My problem is that, this is a very large what if, let's say the helper thread is scanning and comes across a very small .gif early on. The main thread hashes, copies, checks hash, then what? Even if the main thread has a long list of files to copy over, and catches up to the helper thread, there will be no more paths in the list, but the main thread won't know if the helper thread is done, or if it's still scanning. Obviously it would have to wait, but how would it be smart enough to make sure it's not waiting for another path, when the other thread is actually done? From the condition variable that cubbi described, it looks like the thread would just wait until the condition is met. I don't want it to just wait.

Here's what I have in my head so far:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
main() {
   spawn helper thread with scan function of start directory
   while helper thread isn't done // I have no clue how to check this

      wait for notification from helper thread that something has been added

      // may need to convert this to a while statement to be able to unlock/lock list
      for the number of paths in the list
         hash the path
         set destination path
         copy file to destination
         hash copied file

         if hashes don't match
            copy again
            if it doesn't copy correctly
               give up on this file // maybe add to a failed file list
}

helper thread() {
   // ideally, it would scan destination directory first to prevent potential problems
   scan directories recursively
   for directories // until end
      check files against desired extensions
      if file is desired
         lock path list
         add path to list
         unlock list
         notify main it can begin copying (if it's waiting)
   notify main there are no more files to be added // maybe a second CV?
}


Maybe that helped understand what I meant a little better. Also, I'm not sure if you missed it, forgot, or it just doesn't matter, but I plan on having three separate folders depending on the media. Is this just going to be handled by the generate_destination function? I figured it would have been more beneficial to have three lists, three threads to process them. I suppose it doesn't matter either way as there is plenty going on already.
Wouldn't this be slower than using boost?
Suppose that using Boost's copy method lets you copy a file at 200 MiB/s, while using standard library functions lets you copy at 100 MiB/s (this is an exaggerated example. In reality the difference won't be so big if you're doing things right). If you try to copy a 200 MiB file using the standard library (because you need to hash), it'll take 2 seconds. If you hash first (using the standard library) and then use Boost, it'll take 3 seconds. You can't use Boost and hash at the same time*.

I see what you mean and could just use my main thread (hashing, new destination, copying, checking hash, moving to the next one), and a helper thread to scan the directories.
Scanning should happen in the main thread, since it's an I/O-heavy operation. Putting it in a separate thread harms its own performance and copying performance by causing unnecessary seeks across the disk surface. It's actually much faster to interleave operations: find next file, copy file, find next file, copy file, etc.
Some devices behave even worse than HDDs when seeking. Optical disc units, for instance, make a hell of a lot of noise when moving their heads. It almost sounds like the drive is going to tear itself apart.
See also http://www.catb.org/jargon/html/W/walking-drives.html

My problem is that, this is a very large what if, let's say the helper thread is scanning and comes across a very small .gif early on. The main thread hashes, copies, checks hash, then what? Even if the main thread has a long list of files to copy over, and catches up to the helper thread, there will be no more paths in the list, but the main thread won't know if the helper thread is done, or if it's still scanning. Obviously it would have to wait, but how would it be smart enough to make sure it's not waiting for another path, when the other thread is actually done? From the condition variable that cubbi described, it looks like the thread would just wait until the condition is met. I don't want it to just wait.
Let's suppose for a moment that we have a program that behaves like you're describing up to that last sentence. What would you do during that wait or how would you eliminate it? Since this is an I/O-bound program, sooner or later, the CPU will have to wait while the device catches up. There's no way around this problem, since the CPU is much, much faster.
If you don't use threads, though, the OS will handle the waiting for you. For example, std::ifstream::read() won't return until it has filled its buffer with data from the disk. This is known as blocking I/O, and it exists precisely because there is generally not much to do until the data has arrived, so we may as well abstract the wait away.

Also, I'm not sure if you missed it, forgot, or it just doesn't matter, but I plan on having three separate folders depending on the media. Is this just going to be handled by the generate_destination function?
Yes.

I figured it would have been more beneficial to have three lists, three threads to process them.
What would you gain from this?


* Actually, you sort of can if you parallelize while assuming the OS keeps a disk cache in memory, but I don't even want to go there.
Suppose that using Boost's copy method lets you copy a file at 200 MiB/s, while using standard library functions lets you copy at 100 MiB/s (this is an exaggerated example. In reality the difference won't be so big if you're doing things right). If you try to copy a 200 MiB file using the standard library (because you need to hash), it'll take 2 seconds. If you hash first (using the standard library) and then use Boost, it'll take 3 seconds. You can't use Boost and hash at the same time*.

Maybe I'm confusing you, or more likely, myself, but when I said about using boost::filesystem, I meant for the use of locating files, directories, etc. I don't believe, and could be mistaken as I usually am, there is any other option for that. Now, when I find a file, any file, I can use filesystems to check the extension, or pass the path over to the standard libraries by means of converting to a string and handle everything else there. I had intended on doing everything will boost::filesystem, but introducing the hash digest, I believe I would need to open the files to create them. This would also make more sense to copy the files this way.

Scanning should happen in the main thread, since it's an I/O-heavy operation.

When I meant scanning, I didn't mean that I would physically be opening every file. I thought it was actually incredibly fast to scan the filesystem for all files/directories using boost::filesystem (relative to any other, unknown to me, means).

Let's suppose for a moment that we have a program that behaves like you're describing up to that last sentence. What would you do during that wait or how would you eliminate it? Since this is an I/O-bound program, sooner or later, the CPU will have to wait while the device catches up. There's no way around this problem, since the CPU is much, much faster.

When I wrote that, I meant that the scanning thread only searches for extensions and adds them to the list. Like I just said, I thought this would be a faster process than the read/write process of copying the files. I am still having issues with my conditional values in my main thread (will be better to wait until I have a complete understanding of each subject).

What would you gain from this?

I assumed having three threads to copy would be faster than one? I understand on a single core processor, this would make it slower, but on a quad core, would it not be faster than having 1 thread?

I'm having a hard time understanding the hash table. I thought it was something derived based on the file's stats, not the physical data itself. If I'm understanding it correctly, I'd open a file stream, read, let's say, 8 bytes worth of data, use a hash function on them to generate a hash, then add it to a hash table. This is where I'm lost. What does a hash table do? What purpose does it serve? And is it physically stored in something like a vector/list? If the data is destroyed during the process, how do I ensure it copies correctly before destroying the data? Do I copy it into another fstream object then pass it to the hash function?

I've been trying to look for complete examples on it so I can understand how it comes together. I appreciate the help so far, it's given me lots to think about.
Pages: 12