Threading Advice

Pages: 12

Maybe I'm confusing you, or more likely, myself, but when I said about using boost::filesystem, I meant for the use of locating files, directories, etc.

My reply was in response to this paragraph:

I don't plan on copying the files using the file stream as opposed to using the boost::filesystem that has it's own copy function. My program doesn't open or read the file, per say, so I would only be opening it to hash it. Was your proposal based on manually copying the file from one fstream object to another? Wouldn't this be slower than using boost?

When I meant scanning, I didn't mean that I would physically be opening every file. I thought it was actually incredibly fast to scan the filesystem for all files/directories using boost::filesystem (relative to any other, unknown to me, means).

I got what you meant. Nope. It's an I/O operation, and all I/O operations are slow as hell, unless the OS happens to have something in its disk cache (it almost never does).

Like I just said, I thought this would be a faster process than the read/write process of copying the files.

Reading one directory entry is faster than copying one file, yes. And? Threads still need to wait for operations to complete.

I assumed having three threads to copy would be faster than one? I understand on a single core processor, this would make it slower, but on a quad core, would it not be faster than having 1 thread?

Imagine this: in a supermarket there is one cashier who can serve 1 person per minute, and a bagger who can serve 1 person every 10 seconds. To optimize resources, would you hire three more baggers to serve the stream of people from that one cashier, or hire five more cashiers and have the one bagger cycle through them?
The same is happening here. The disk is your cashier and the CPU is your bagger. The bottleneck is the disk. Adding more threads doesn't increase performance (it hurts it, actually) because the CPU idles most of the time, anyway.
You could try adding more disks and setting up a RAID array, though.

I'm having a hard time understanding the hash table.

Who said anything about a hash table? A hash table is a kind of fast associative array, like an std::map. It has nothing to do with anything mentioned in this thread.

If I'm understanding it correctly, I'd open a file stream, read, let's say, 8 bytes worth of data, use a hash function on them to generate a hash

If you hash only 8 bytes, how will you know if the ninth byte gets corrupted? The entire file has to be hashed.

Volatile Pulse (1543)

If you hash only 8 bytes, how will you know if the ninth byte gets corrupted? The entire file has to be hashed.

I was thinking that when you said hash digest, you meant hash table, which was why I was thinking small hashes, then adding them to a table. Now that I understand one hash per file, what would the hash look like? I thought all hashes were supposed to be the same size (not comparing to the same files, but as to the length of the hash itself).

Ok, so I open a desired file and read each byte, applying a hash function to each byte, then what? How is this more efficient than just copying, byte by byte, the entire file into a variable? Maybe I'm missing something, most of what I've read has been dealing with small files or a specific string or set of numbers.

the CPU idles most of the time, anyway.

I forgot about about this fact, and brings up a very good point and makes sense.

You could try adding more disks and setting up a RAID array, though.

Until RAIDs come standard on mid ranged personal computers, I will probably never have that set up.

Back to hashing, I believe what I've been understanding, if it's correct, have a function that accepts a byte (let's say a char), hashes it (using bit wise operators seems to be the consensus so far), and returns a char. That seems like it would be slower in the long run, and some of the functions I've seen seem to be prone to two different bytes possibly returning the same hash.

After each byte is hashed, does it get stored in a list?

char HashFunction(const char singleByte) {
   // Just add the algorithm later
   char hash = singleByte;
   return hash;
}

string GenerateHash(const string filePath) {
   ifstream curFile(filePath);
   string fileHash = "";

   while (!curFile.eof())
      fileHash += HashFunction(curFile.get());

   curFile.close();

   return fileHash;
}

bool CopyFile(const string fileOrigin, const string fileDest){
   ifstream fileToCopy(filePath);
   ofstream fileToWrite(fileDest);
   string fileHash = GenerateHash(fileOrigin);

   // I believe the "GenerateHash" function should be integrated here to
   // reduce the I/O times.
   while (!fileToCopy.eof())
      fileToWrite.put(fileToCopy.get());

   string fileHashFinal = GenerateHash(fileDest);

   if(fileHash != fileHashFinal)
      return false;

   return true;
}

Obviously that's really simple, but is that basically what you're suggesting? I might have some issues there with syntax or something, but I'm getting tired. I'm trying to understand everything, but want to make sure I'm on the right path so far.

Edit: I think I'm starting to realize what you suggested here...While reading up on CRC32, it seems it's ideal for buffering a file, typically while copying (I see what you did). What I've gotten so far is, let's say, you take 16 bytes, hash them, copy them, check the hashes, move to the next block. I'm still missing how hashing data is any better than just copying it, or is the trade-off worth the assurance that data is getting copied correctly?

At this point, what is a significant chunk of data to process? Buffering doesn't necessarily mean writing that file, just you've read that into the buffer. Also, I noticed that most hashes are small, while the data it represents is rather large (or small). I understand the concept here, but I'm looking at byte by byte reading. I assume the algorithm handles a significant chunk of data and returns a small hash to represent that chunk.

Let's say I have a 2GB file and process 512MB chunks, would I have a list of 4 hashes to represent that file and then double check that after it's copied all 4 hashes match up? What's the limit on what can be stored in the buffer (I assume it differs by the processor)? While generating the hash, everything is stored in the buffer, but to ensure the buffer isn't cleared, can you write a small section to a file? Do you read the file byte by byte, store it as a string or something else, and pass it, and the length of it, to the hash?

Sorry for all the questions, I'm just trying to get a solid grasp on this.
_{8/29/2013 3:12AM EST}

Last edited on

Volatile Pulse (1543)

Ok, so I have some great news, at least it is for me. I believe I finally understand the hashing of files. I wrote this quick program just to test it out.

#include<list>          // used for list of hashes
#include<string>        // used for creating hashes
#include<chrono>        // used for timer
#include<fstream>       // used for files
#include<iostream>      // used for display
#include<functional>    // used for hash function

const signed MAX_BUFFER = 1024;

int main() {
    // Set up timer
    using std::chrono::duration_cast;
    using std::chrono::milliseconds;
    std::chrono::high_resolution_clock clock;
    auto start = clock.now();

    std::ifstream inFile("test.mp3", std::ifstream::binary);
    // fstream to read hashes later
    std::fstream outFile("output.mp3", std::fstream::out | std::fstream::in
                         | std::fstream::trunc | std::fstream::binary);

    if (!inFile.is_open() || !outFile.is_open())
        return 1;

    std::list<std::size_t> hashList;
    std::hash<std::string> hashFunc;
    char buffer[MAX_BUFFER];

    while (inFile) {
        inFile.read(buffer, MAX_BUFFER);
        hashList.push_back(hashFunc(buffer));
        if (!inFile)
            outFile.write(buffer, inFile.gcount());
        else
            outFile.write(buffer, MAX_BUFFER);
    }
    outFile.seekg(0);
    outFile.clear();
    while (outFile) {
        outFile.read(buffer, MAX_BUFFER);
        if (*(hashList.begin()) == hashFunc(buffer))
            hashList.pop_front();
        else {
            std::cout << "Epic File Copy Fail! Quitting!";
            return 2;
        }
    }

    inFile.close();
    outFile.close();
    delete[] buffer;

    auto end = clock.now();
    std::cout << "\n\nFile copied in "
              << duration_cast<milliseconds>(end - start).count()
              << "ms.";
    return 0;
}

I average between 1-1.5 seconds on a 10.3MB audio file (handiest thing I had available) which I don't think is too bad on my little netbook. Threads are my next task (I honestly thought this was so much harder than it really was) but first, is there anyway to know how many hashes I will need? I thought about using fstream.seekg() to get to the end and calculate from that, but on a large file, that could be absolutely massive. I would still like to have an idea of a solid buffer since 1024 seems to still be relatively small compared to the sizes of the files (I think the larger buffers help decrease the time it takes to process a file).

Edit: Bumping the buffer up to 8192 is roughly 6x faster.

Last edited on

helios (17607)

CRC32 is not good for hashing. It's relatively easy to find collisions for it by accident.

Generate one digest per file. It's generally of no use to know which parts of a backed up file are corrupted.

At this point, what is a significant chunk of data to process?

Between 4 KiB and 1 MiB. 4 KiB is considered the minimum buffer you should read from a file (unless EOF is closer, obviously).

Volatile Pulse (1543)

From my last post, you can see that I just used the C++ hash function. What algorithm does it use? Our reference didn't specify. Is it safe to work with?

I really need to start just sitting down and attempting the stuff by myself. I ran into a hiccup a little bit ago when considering a buffer. I started with SIZE_MAX / CHAR_BIT. That was obviously too small, then I tried UCHAR_MAX but it kept crashing on me. I tried UINT_MAX and thought I found a good way to calculate the buffer on any system, then I realized it translated to 1/1 on my computer. I ended up with just doing 1MB.

I've also figured out most of questions I've asked you thus far. One thing that I noticed when I first started playing around was that I ran out of room on my HDD and caused the copy to fail (small HDD). But I was wondering if it was realistic, on a more optimized machine, to make a map of hash digests and check for duplicate files. My only fear is, let's say I have 10 1.5GB movies on my HDD, that that would create a lot of hashes just for the movie. Is there anything wrong with converting an entire hash digest into one hash? Is it possible or even worth it? (I'm not going to temp fate on that one, something may blow up on my computer)

Anyways, here is my current, and almost final, single file hashing. I really need to get into threads now that I have this pretty much down (a better hash function may be in order down the road):

#include<list>          // used for list of hashes
#include<string>        // used for creating hashes
#include<chrono>        // used for timer
#include<sstream>       // used for progress display
#include<fstream>       // used for files
#include<iomanip>       // used for decimal precision
#include<iostream>      // used for display
#include<functional>    // used for hash function

using std::chrono::duration_cast;
using std::chrono::milliseconds;

const size_t MAX_BUFFER = 1048576; // 1 MB (1024 * 1024)
const char SOLID = 219;
const char HALF = 221;
const char EMPTY = 249;

float DisplayProgress(const std::streampos&, const std::streampos&, const float&);

int main() {
    // Set precision of floating numbers
    std::cout << std::fixed << std::setprecision(2);
    // Set up timer
    std::chrono::high_resolution_clock clock;
    auto start = clock.now();

    std::ifstream inFile("test.avi", std::fstream::ate | std::ifstream::binary);
    // fstream to read hashes later
    std::fstream outFile("output.avi", std::fstream::out | std::fstream::in
                         | std::fstream::trunc | std::fstream::binary);

    if (!inFile.is_open() || !outFile.is_open()) {
        std::cout << "File can't be opened or created! Quitting!";
        return 1;
    }

    // Get file size
    std::streampos fileSize = inFile.tellg();
    inFile.seekg(0);
    std::streampos processed = inFile.tellg();

    // Variables for hashing
    std::list<std::size_t> hashList;
    std::hash<std::string> hashFunc;
    char buffer[MAX_BUFFER];
    float prevPercent = 0;

    // Create hashes and copy file
    while (inFile) {
        prevPercent = DisplayProgress(processed, fileSize, prevPercent);
        inFile.read(buffer, MAX_BUFFER);
        hashList.push_back(hashFunc(buffer));
        processed += inFile.gcount();
        outFile.write(buffer, inFile.gcount());
    }

    DisplayProgress(processed, fileSize, prevPercent);

    // Set pointer back to start
    outFile.seekg(0);
    outFile.clear();

    // Check the hashes
    while (outFile) {
        outFile.read(buffer, MAX_BUFFER);
        if (*(hashList.begin()) == hashFunc(buffer))
            hashList.pop_front();
        else {
            std::cout << "Epic File Copy Fail! Quitting!";
            return 2;
        }
    }

    // Free resources
    inFile.close();
    outFile.close();
    delete[] buffer;

    // Display Stats
    auto end = clock.now();
    std::cout << "\n\nFile copied in "
              << duration_cast<milliseconds>(end - start).count()
              << "ms.";
    return 0;
}

float DisplayProgress(const std::streampos& processed, const std::streampos& total,
                      const float& prevPercent) {
    const float percent = (((processed * 1.0) / total) * 100.0);
    if ((static_cast<int>(percent * 10) <= static_cast<int>(prevPercent * 10))
         && (static_cast<int>(percent) != 0))
        return percent;

    // Set maximum allowed widths
    const size_t progressLen = 50;
    const size_t valueLen = 6;
    const size_t percentLen = 6;
    const size_t dataLen = 2;

    std::ostringstream display;

    unsigned solidLen = static_cast<unsigned>(progressLen * (percent / 100.0));
    display << std::setfill(SOLID) << std::setw(solidLen / 2) << "";

    if (solidLen % 2) {
        display << std::setfill(HALF) << std::setw (1) << "";
        solidLen ++;
    }

    if (solidLen < progressLen)
        display << std::setfill(EMPTY) << std::setw((progressLen - solidLen) / 2) << "";

    display << std::setfill(' ') << " ";

    display << std::setw(percentLen) << std::setprecision(2) << std::fixed
            << percent << "%";

    std::cout << std::setfill('\b') << std::setw(display.str().length()) << "";
    std::cout << display.str() << std::flush;

    return percent;
}

████████████∙∙∙∙∙∙∙∙∙∙∙∙∙  48.04%

█████████████████████████ 100.00%

108730.34MB file copied in 370296ms.
Process returned 0 (0x0)   execution time : 370.469 s
Press any key to continue.

The progress bar was cire's idea from my other thread, I just tweaked it a little more to my liking. If you have any suggestions on a way to improve on the I/O portion, or the hashing, I'd love to hear it. And thank you again for your tolerance.

helios (17607)

From my last post, you can see that I just used the C++ hash function. What algorithm does it use? Our reference didn't specify. Is it safe to work with?

It's probably designed for building hash tables, not for data integrity checks.
You can find a simple and easy to use C++ implementation of SHA-1 by Paul E. Jones here:
http://tools.ietf.org/html/rfc3174

I was wondering if it was realistic, on a more optimized machine, to make a map of hash digests and check for duplicate files.

Yes, but in my experience it's often of little use. Hashing will only detect exact duplicates, not perceptual duplicates (e.g. a lossily recompressed version of the same picture).

Is there anything wrong with converting an entire hash digest into one hash?

Sorry, but you'll need to clarify your terminology. What exactly is the difference between "hash" and "hash digest"?

I'm not going to temp fate on that one, something may blow up on my computer

???

Volatile Pulse (1543)

Sorry, but you'll need to clarify your terminology. What exactly is the difference between "hash" and "hash digest"?

Oh, I'm sorry, I meant a hash as in one block of data, or every time i use the hash function, I receive a hash. By hash digest, I was referring to the completed file as a whole, all of the individual hashes. Basically, I was wondering if you could essentially use the hash function on the collection of hashes a second time, or multiple times, to have just like one large one. The more I think about it, the less and less it's really needed.

My computer is slow, not only reading and writing, but also when it comes to processing power. I'm willing to try certain things on my own, but I'm not sure my computer could handle a large I/O intensive long running program. I already feel sorry for making it copy Wreck It Ralph about 20 times to test my program.

Yes, but in my experience it's often of little use. Hashing will only detect exact duplicates, not perceptual duplicates (e.g. a lossily recompressed version of the same picture).

And it's pretty much impossible, using hashes, to check for similar duplicates, correct?

When I first asked you about using MD5, you said that was more for cryptography. Isn't SHA algorithms also for crypto programs too? I understand the hashes would mean something then, if I ever needed to share them over other computers, etc. I'll definitely being looking into them though, I don't think I could even create something 1/10 as good as CRC, so for now, other hash functions are fine with me.

helios (17607)

Basically, I was wondering if you could essentially use the hash function on the collection of hashes a second time, or multiple times, to have just like one large one.

Yes, but it's not such a great idea. If you want just a single hash, then hash the file as a whole. Hash functions implementations let you do that. You can pass them separate buffers in consecutive calls and they'll be inputted into the function as though you passed them in a single call.

I'm not sure my computer could handle a large I/O intensive long running program

Trust me, you're not capable of giving your computer enough work to break it. Don't be afraid to experiment.

And it's pretty much impossible, using hashes, to check for similar duplicates, correct?

Well, no. There is such a thing as perceptual hashes, which map, for example, images to binary strings in such a way that the Hamming distance between hashes of perceptually similar images is small. Most hash functions are not perceptual, though.

When I first asked you about using MD5, you said that was more for cryptography.

I said it was proscribed (i.e. no longer recommended) for cryptographic applications.
Cryptographic hash functions work great for data integrity checks because they're designed to make it hard to generate collisions.

Hzj jie (30)

thread is simple as where the code is running, every program has at least one thread, even it may inherit from parent process.
the OS will try to switch threads running in a processor, and if no other valid thread <not in join / sleep / wait status> should be switched in, the original thread will continually work in the processor. and a thread switch will cost resource, since OS need to, 1. select a thread, 2. save the status, say, registers, which is just the 'system time' in Linux or 'kernel time' in windows.
base on this basic information, one thread per processor is the ideal status, more threads will use more kernel time to do the switching. so if you are working on a 4 core I7 processor, to fully use the processor resource, you should exactly have 4 threads.
but yes, this is the ideal status, even windows itself will have over hundred of threads in it's kernel & services. and before epoll in Linux, people even use threads to do sync IO, which means one thread one connection, which is totally a waste of processor resource. give a number based on my experience, 400 threads in Linux 2.6 kernel to do sync IO, with 24 cores, the processor total usage is around 50%, and the kernel time is around 15% or even higher.

lots of talents are trying to improve the thread switching logic to make it more efficient, but in deed we should not fully relay on the OS to handle it, since the OS has very limited information about how the thread will go on, you may only check for whether the IO has finished, then sleep again, it only takes several processor ticks, but OS may need hundreds processor ticks to do the switching.

Topic archived. No new replies allowed.

Pages: 12