If you hash only 8 bytes, how will you know if the ninth byte gets corrupted? The entire file has to be hashed. |
I was thinking that when you said hash digest, you meant hash table, which was why I was thinking small hashes, then adding them to a table. Now that I understand one hash per file, what would the hash look like? I thought all hashes were supposed to be the same size (not comparing to the same files, but as to the length of the hash itself).
Ok, so I open a desired file and read each byte, applying a hash function to each byte, then what? How is this more efficient than just copying, byte by byte, the entire file into a variable? Maybe I'm missing something, most of what I've read has been dealing with small files or a specific string or set of numbers.
the CPU idles most of the time, anyway. |
I forgot about about this fact, and brings up a very good point and makes sense.
You could try adding more disks and setting up a RAID array, though. |
Until RAIDs come standard on mid ranged personal computers, I will probably never have that set up.
Back to hashing, I believe what I've been understanding, if it's correct, have a function that accepts a byte (let's say a char), hashes it (using bit wise operators seems to be the consensus so far), and returns a char. That seems like it would be slower in the long run, and some of the functions I've seen seem to be prone to two different bytes possibly returning the same hash.
After each byte is hashed, does it get stored in a list?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
|
char HashFunction(const char singleByte) {
// Just add the algorithm later
char hash = singleByte;
return hash;
}
string GenerateHash(const string filePath) {
ifstream curFile(filePath);
string fileHash = "";
while (!curFile.eof())
fileHash += HashFunction(curFile.get());
curFile.close();
return fileHash;
}
bool CopyFile(const string fileOrigin, const string fileDest){
ifstream fileToCopy(filePath);
ofstream fileToWrite(fileDest);
string fileHash = GenerateHash(fileOrigin);
// I believe the "GenerateHash" function should be integrated here to
// reduce the I/O times.
while (!fileToCopy.eof())
fileToWrite.put(fileToCopy.get());
string fileHashFinal = GenerateHash(fileDest);
if(fileHash != fileHashFinal)
return false;
return true;
}
|
Obviously that's really simple, but is that basically what you're suggesting? I might have some issues there with syntax or something, but I'm getting tired. I'm trying to understand everything, but want to make sure I'm on the right path so far.
Edit: I think I'm starting to realize what you suggested here...While reading up on CRC32, it seems it's ideal for buffering a file, typically while copying (I see what you did). What I've gotten so far is, let's say, you take 16 bytes, hash them, copy them, check the hashes, move to the next block. I'm still missing how hashing data is any better than just copying it, or is the trade-off worth the assurance that data is getting copied correctly?
At this point, what is a significant chunk of data to process? Buffering doesn't necessarily mean writing that file, just you've read that into the buffer. Also, I noticed that most hashes are small, while the data it represents is rather large (or small). I understand the concept here, but I'm looking at byte by byte reading. I assume the algorithm handles a significant chunk of data and returns a small hash to represent that chunk.
Let's say I have a 2GB file and process 512MB chunks, would I have a list of 4 hashes to represent that file and then double check that after it's copied all 4 hashes match up? What's the limit on what can be stored in the buffer (I assume it differs by the processor)? While generating the hash, everything is stored in the buffer, but to ensure the buffer isn't cleared, can you write a small section to a file? Do you read the file byte by byte, store it as a string or something else, and pass it, and the length of it, to the hash?
Sorry for all the questions, I'm just trying to get a solid grasp on this.
8/29/2013 3:12AM EST