Quick comparison of two files.

Hi all. Please advise the fastest way to compare files byte by byte.
File size varies from 1 byte to 2 gb
Here is my current code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

//BUFFER_SIZE = 1 mb

bool FindDublicateFiles::isFilesEqual(const std::string& lFilePath, const std::string& rFilePath) const
{
    std::ifstream lFile(lFilePath.c_str(), std::ifstream::in | std::ifstream::binary);
    std::ifstream rFile(rFilePath.c_str(), std::ifstream::in | std::ifstream::binary);

    if(!lFile.is_open() || !rFile.is_open())
    {
        return false;
    }

    char *lBuffer = new char[BUFFER_SIZE]();
    char *rBuffer = new char[BUFFER_SIZE]();

    do {
        lFile.read(lBuffer, BUFFER_SIZE);
        rFile.read(rBuffer, BUFFER_SIZE);

        if (std::memcmp(lBuffer, rBuffer, BUFFER_SIZE) != 0)
        {
            delete[] lBuffer;
            delete[] rBuffer;
            return false;
        }
    } while (lFile.good() || rFile.good());

    delete[] lBuffer;
    delete[] rBuffer;
    return true;
}
As a quick test you could check if the 2 files have the same size.

Lines 14/15, you call the default constructor for char, which sets all the chars to 0. It's completely useless and takes time for nothing.

You could use a buffer bigger than 1MB.

If you call this function a lot, you could also allocate the buffers elsewhere and pass them to the function. This way you'd avoid a lot of allocation/deallocation cycles.
Thank you for having responded

Lines 14/15, you call the default constructor for char, which sets all the chars to 0. It's completely useless and takes time for nothing.

if the file size is smaller than the buffer size that remaining buffer will be filled with "garbage" - it affects the outcome. need to clean buffer.
1
2
3
4
5
6
7
8
9
10
11
12
do {
        lFile.read(lBuffer, BUFFER_SIZE);
        rFile.read(rBuffer, BUFFER_SIZE);
	numberOfRead = lFile.gcount();//I check the files with the same size

        if (std::memcmp(lBuffer, rBuffer, numberOfRead) != 0)
        {
			memset(lBuffer,0,numberOfRead);
			memset(rBuffer,0,numberOfRead);
			return false;
        }
    } while (lFile.good() || rFile.good());


You could use a buffer bigger than 1MB.

increase buffer for small files increases the time
Last edited on
If you want to be efficient with file I/O, try memory mapping:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
int main()
{
    io::mapped_file_source f1("test.1");
    io::mapped_file_source f2("test.2");

    if(    f1.size() == f2.size()
        && std::equal(f1.data(), f1.data() + f1.size(), f2.data())
       )
        std::cout << "The files are equal\n";
    else
        std::cout << "The files are not equal\n";
}
Last edited on
just for fun, I ran this on a few boxes. On Linux, I was comparing two copies of Intel parallel studio distro (size 2,152,945,149 bytes), on Sun and IBM, two copies of some binary of size 898,215,121 bytes)

memory-mapped version (as posted by me)
gcc(linux)   0.92 s
intel(linux) 0.97 s
sun(sun)     4.04 s
xlc(ibm)     3.67 s

ifstream.read() version into a 1M buffer (as posted by seftoner)
gcc       1.80 s
intel     1.89 s
sun(sun) 14.5 s
xlc(ibm)  2.43 s

trivial I/O stream-based version
1
2
3
   if(std::equal(std::istreambuf_iterator<char>(f1), 
                 std::istreambuf_iterator<char>(), 
                 std::istreambuf_iterator<char>(f2)))

gcc(linux):    14.1 s
intel(linux):  35.2 s
sun(sun)       29.3 s
xlc(ibm)       27.1 s


just for fun, I ran this on a few boxes. On Linux, I was comparing two copies of Intel parallel studio distro (size 2,152,945,149 bytes), on Sun and IBM, two copies of some binary of size 898,215,121 bytes)

memory-mapped version (as posted by me)
gcc(linux) 0.92 s
intel(linux) 0.97 s
sun(sun) 4.04 s
xlc(ibm) 3.67 s

ifstream.read() version into a 1M buffer (as posted by seftoner)
gcc 1.80 s
intel 1.89 s
sun(sun) 14.5 s
xlc(ibm) 2.43 s

trivial I/O stream-based version
1
2
3
if(std::equal(std::istreambuf_iterator<char>(f1),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(f2)))

gcc(linux): 14.1 s
intel(linux): 35.2 s
sun(sun) 29.3 s
xlc(ibm) 27.1 s



WOW! Very quickly! Why does my code works for me very slowly. For example comparison of two files of 650 MB size takes 40 seconds

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//bufferSize = 8 mb
{
    std::ifstream lFile(lFilePath.c_str(), std::ios::in | std::ios::binary);
    std::ifstream rFile(rFilePath.c_str(), std::ios::in | std::ios::binary);


    if(!lFile.good() || !rFile.good())
    {
        return false;
    }

    std::streamsize lReadBytesCount = 0;
    std::streamsize rReadBytesCount = 0;

    do {
        lFile.read(p_lBuffer, *bufferSize);
        rFile.read(p_rBuffer, *bufferSize);
        lReadBytesCount = lFile.gcount();
        rReadBytesCount = rFile.gcount();

        if (lReadBytesCount != rReadBytesCount || std::memcmp(p_lBuffer, p_rBuffer, lReadBytesCount) != 0)
        {
            return false;
        }
    } while (lFile.good() || rFile.good());

    return true;
}
>
1
2
bool FindDublicateFiles::isFilesEqual(const std::string& lFilePath, 
>                                     const std::string& rFilePath) const


If this is to be done many times for the same files:

1. Pre-compute and store a checksum (say MD5) for each large file file (along with a timestamp).

2. If the file was not modified after the timestamp, compare the checksums first. Compare byte by byte only if the checksums and the file sizes match.
1. Pre-compute and store a checksum (say MD5) for each large file file (along with a timestamp).

2. If the file was not modified after the timestamp, compare the checksums first. Compare byte by byte only if the checksums and the file sizes match.


That's what I do
Topic archived. No new replies allowed.