How to Compare two files and display their comparison percentage?

Help!
How to compare two files(two texts documents) and display their similarity percentage by using C++?
Last edited on
1) Open files.
2) Iterate them both at the same time.
3) If the char is the same, add to a are_similar counter.
4) If the char is different, add to a are_different counter.
5) similarity_percentage = are_similar * 100 / (are_similar + are_different)

And here's some sample code to get you started. It wasn't tested.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <ios>
#include <iterator>
#include <fstream>
#include <string>

double percentage_similar(const std::string &filename_one, const std::string &filename_two)
{
    std::ifstream file_one(filename_one.c_str(), std::ios_base::binary);
    std::ifstream file_two(filename_two.c_str(), std::ios_base::binary);

    std::istreambuf_iterator<char> file_one_iter(file_one);
    std::istreambuf_iterator<char> file_two_iter(file_two);
    std::istreambuf_iterator<char> file_end;

    unsigned int are_similar = 0;
    unsigned int are_different = 0;

    while (file_one_iter != file_end || file_two_iter != file_end)
    {
        // deal with special case that first file is smaller
        if (file_one_iter == file_end)
        {
            ++are_different;
            ++file_two_iter;
        }
        else
        // deal with special case that second file is smaller
        if (file_two_iter == file_end)
        {
            ++are_different;
            ++file_one_iter;
        }
        else
        if (*file_one_iter++ == *file_two_iter++)
            ++are_similar;
        else
            ++are_different;
    }

    return static_cast<double> (are_similar) * 100 / (are_similar + are_different);
}

@OP: you need to define a `distance'
Using Catfish3 definition you would say that
_ abcdefghijklmnopqrstuvwxyz
_ zabcdefghijklmnopqrstuvwxy
are completely different
@ ne555: I am interested in more detail. Just saying "distance" is vague, to me at least.
Is it possible for xml files also?
Do you expect files to be identical and just want to count how many characters differ between them? Then go for something like Catfish3's suggestion.

If not, you need a far more sophisticated approach as ne555 suggested.

1) You will need a similarity measure and score for characters. In your case this may be easy, e.g.,: give each pair of equivalent characters in the two files a score of +1 if both chars are equal, 0 if they differ. Depending on the task, this similarity measure may be too easy though (Is an "E" as different from an "e" as from a "Y"?).


2) Because of possible insertions and deletions (the example ne555 gave contains both, or 1 "move"), the problem gets far more complicated because it is not at all obvious which characters in the two files are equivalent.

You may want to look at the UNIX diff command and the algorithms used in stuff like this. Many version control systems like GIT need to perform this task a lot.

I suggest you look up "Edit distance" on google.

In Bioinformatics, a special version of your problem (where the 2 files being compared are actually DNA strands) is usually solved by dynamic programming, more specifically, the Needleman-Wunsch Algorithm.

Finally, in the case of XML, you may be better of using an XML parser first (unless you expect one of the files to be broken XML du to changes). I suggest you describe in more detail why you want to compare the files.
Last edited on
Okay. Well, I need a program just like plagiarism but only with two files (can be either excel or any text documents). Here I need to compare two files and display their similarity percentage. Suppose the first word of file1 contains the word "Is" and the file2 contains "is" it should display it as same and calculate the total number of similar words and display PERCENTAGE.
I would appreciate any help :)
Last edited on
Use getline and compare them line by line either by length of each line or by the similarities between words in each line.
Well, by using getline I would compare the whole sentence rather than each and every word. and one more thing, I'm completely inserting some random files.
(can be either excel or any text documents)

Excel, Word, PDF documents (to name a few) are not text documents!

Rule of thumb: if when you open it with Notepad, you see all kinds of crazy symbols, it's not a text document.

Well, by using getline I would compare the whole sentence rather than each and every word. and one more thing, I'm completely inserting some random files.

Not necessarily. You can pass the delimiter as space:
1
2
3
4
5
std::ifstream file("input.txt");
std::string str;

std::getline(file, str, ' '); // or, a bit worse
file.getline(&str.front(), str.length(), ' ');


I would appreciate any help :)

It is not my purpose here to bring you down, but it has to be said: to me it's clear you don't know what you're doing. You need to gain more knowledge before you can create this program.

And make no mistake, this program you want to create is not simple, which is why I doubt anyone here will give you a full solution.
Thank you very much.
Topic archived. No new replies allowed.