Checking for duplicates in a txt file?

Here is my task:
Your task is to write a program that reads a file and prints all lines that contain a repeated word (such as an accidental “the the”), together with their line numbers.

I thought of using getline to get each line and increase the counter by one. Problem is, I have no way of checking the string getline saves to for duplicates. Does anyone here know a better way to do this?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
	ifstream in_file;
	ofstream out_file;
	int counter = 0;
	string line;

	out_file.open("q3.txt");
	out_file << "Did you know this this message contains lots of redundant" << endl;
	out_file << "code code that does not make sense" << endl;
	out_file << "This is the only non-duplicated line" << endl;
	out_file.close();

	in_file.open("q3.txt");
	
	while (!in_file.eof())
	{
		getline(in_file, line);
		counter++;
	}
Last edited on
your logic needs to go in after line 16. You also don't need a counter to increment the number of lines in the file do you?

Read the line into a string like you do.
then you need to parse this line, adding each word into an array or vector.
then you need to iterate over this vector, comparing the i+1 element with element i. if they are the same you know this is a line with repeated words in.
Thank you for this advice. But if I don't use a counter, how do I get which line number the loop is on?
Apologies. i didn't read your task properly.
you still want to do what i said, but when you do find a duplicate word (using the logic i said), you need to add this line number to an array of integers. So that way when you are out of your while loop, your array will contain all of the relevant line numbers.

edit:
i re-read it again:
that reads a file and prints all lines


you need to keep an array of strings. when you get a line that contains the duplicate words add the whole line (i.e. your line variable) to your string array/vector), and you can concatenate your line number to this string before adding to the vector.

use:
http://en.cppreference.com/w/cpp/string/basic_string/to_string

on your line number before trying to concatenate it to your line.
Last edited on
you need to keep an array of strings.

I don't see a need to store any strings in arrays or vectors? Or interger, for that matter.

The problem statement just requires the lines (containing duplicate words) to be printed with their number. In fact, as there's no requirement (as stated) to track all the duplicates, you only need to find the first duplicate.

1
2
3
4
5
6
7
8
9
10
11
	// You shouldn't use eof() to control the read loop!
        // getline() returns a reference to the istream which will evaluate
	// to false if there's an error or the stream runs out of data to read
	while (getline(in_file, line))
	{
		counter++;
		if(hasDuplicates(line)) // or whatever
		{
			cout << counter << " : " << line << endl;
		}
	}


The "trivial" part is coding bool hasDuplicates(const string& line); :-)

You split the line up into words checking each new word is the same as the last one you got. So you just need to keep track of what the last word was.

Are you familiar with istringstream? (prob the easiest route.)
http://www.cplusplus.com/reference/sstream/istringstream/

Or you could use string's find, find_first_of, subst, etc functions to achieve the same aim.
http://www.cplusplus.com/reference/string/string/

Andy
Last edited on
Thank you for both your inputs, Andy and Mut. This is very helpful information :)

I think mut wanted me to use a vector to split the string line into words and store them in it so that it would be easy to check via index. It seems like a good idea and pretty simple for me.

Definitely appreciate the extra input you have on this though @andywestken. I will research into the functions you listed. :)
I don't see a need to store any strings in arrays or vectors?

Yea you're right, sorry. I'm just used to collect the info i need into collections to use later on, but yep, it's not needed here at all. OP: sorry for confusing matters.
No need to say sorry, I should be the one saying thank you. I actually find the array method seemingly easier than researching into a new function but i will look into both.
I actually find the array method seemingly easier...

The work you need to do to split the string (line) up into words is the same whichever approach you chose.

If you follow the approach I suggested you just keep an eye on the last word you extracted from the string and, if it's the same as the latest one, you've got a duplicate. It should be a minor change to the code that tokenizes the line into words.

If you populate an array with the values and then search for double words then you're going to end up writing more code than is actually needed here. And getting the computer to more work than it needs to.

e.g. for the line "code code that does not make sense" the code can stop as soon as the first two "code" and "code" tokens/words have been extracted from the string, rather than split the whole line and then find the duplicates as a second step.

Of course, tokenizing the sentence to a array (or vector) of words would make sense if you wanted to do various tests on the same sequence of words.

Andy
Last edited on
Hi andy, I looked up stringstream like you said and wrote this function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bool hasDuplicates(string line)
{
	stringstream ss(line);
	string token = "";
	string latestToken = "";
	bool duplicateFound = false;
	
	while (getline(ss, token, ' '))
	{
		if (token == latestToken)
		{
			duplicateFound = true;
		}

		latestToken = token;
	}

	return duplicateFound;
}

Any improvements you can suggest?
Last edited on
Well, it's pretty much there, but...

Using a istringstream as a pretend file for test purposes (see code following), I get the following with your hasDuplicates() function:

1 : Did you know this this message contains lots of redundant
2 : code code that does not make sense
4 : All bar two lines (with with text) have have duplicates
5 : in in them them them!
9 : This has   no duplicates   either but does have  extra  spaces
11 : The following following line is just just spaces!
12 :
13 : That's all all all, folks!


where I've tweaked your example test file data a bit, adding an empty and a space only lines to your test data. Plus a line with no duplicates but extra spaces.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <sstream>
#include <string>
using namespace std;

bool hasDuplicates(string line)
{
    stringstream ss(line);
    string token = "";
    string latestToken = "";
    bool duplicateFound = false;
    
    while (getline(ss, token, ' '))
    {
        if (token == latestToken)
        {
            duplicateFound = true;
        }

        latestToken = token;
    }

    return duplicateFound;
}

int main()
{
    // Pretend file data - "q3.txt"
    const char testData[] =
    "Did you know this this message contains lots of redundant\n"
    "code code that does not make sense\n"
    "\n"
    "All bar two lines (with with text) have have duplicates\n"
    "in in them them them!\n"
    "\n"
    "This is one of the two lines with no-duplicates\n"
    "\n"
    "This has   no duplicates   either but does have  extra  spaces\n"
    "\n"
    "The following following line is just just spaces!\n"
    "      \n"
    "That's all all all, folks!\n";

    istringstream in_file(testData); // pretend file!
    // cf. ofstream in_file("q3.txt");

    int counter = 0;
    string line;
    while (getline(in_file, line))
    {
        counter++;
        if(hasDuplicates(line)) // or whatever
        {
            cout << counter << " : " << line << endl;
        }
    }

    return 0;
}


The problem is you use of getline, which isn't the right choice here. This is a program which (1) uses your approach to tokenize a string on spaces and (2) does the same thing but using operator>>

input: "one two  three   four    five     spaces!"

test_getline
1 = one
2 = two
3 =
4 = three
5 =
6 =
7 = four
8 =
9 =
10 =
11 = five
12 =
13 =
14 =
15 =
16 = spaces!

test_extract_op
1 = one
2 = two
3 = three
4 = four
5 = five
6 = spaces!


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <iostream>
#include <sstream>
#include <string>
using namespace std;

void test_getline(string line)
{
    cout << "test_getline" << endl;
    stringstream ss(line);
    string token = "";
    int counter = 0;
    while (getline(ss, token, ' '))
    {
        ++counter;
        cout << counter << " = " << token << endl;
    }
    cout << endl;
}

void test_extract_op(string line)
{
    cout << "test_extract_op" << endl;
    stringstream ss(line);
    string token = "";
    int counter = 0;
    while (ss >> token)
    {
        ++counter;
        cout << counter << " = " << token << endl;
    }
    cout << endl;
}

int main()
{
    char test_data[] = "one two  three   four    five     spaces!";

    cout << endl;
    cout << endl;

    test_getline(test_data);
    test_extract_op(test_data);

    return 0;
}


you can see that operator>> is a better fit here than getline (which returns a "token" for each and every space it find.)

So an improved version of you function would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
bool hasDuplicates(string line)
{
	stringstream ss(line);
	string token = "";
	string latestToken = "";
	bool duplicateFound = false;
	
	while (ss >> token) // now using extraction op
	{
		if (token == latestToken)
		{
			duplicateFound = true;
		}

		latestToken = token;
	}

	return duplicateFound;
}


(You could fix the getline version by (1) ignoring empty token (using string::empty()) or (2) eating the while space (ss >> ws), but these approaches don't make sense when you can just use the extraction operator.)

So using ss >> token rather then getline(ss, token, ' ') will get your function to behave as required.

To be continued...

Andy
Last edited on
Any improvements you can suggest?

Well...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bool hasDuplicates(const string& line) // pass by const ref
{
    stringstream ss(line);
    string token; // std::string inits itself
    string latestToken;
    bool duplicateFound = false;

    while (ss >> token) // now using extraction op
    {
        if (token == latestToken)
        {
            duplicateFound = true;
            break; // break out of loop as soon as first duplicate found
        }

        latestToken = token;
    }

    return duplicateFound;
}


or preferably (to me), eliminate the temporary duplicateFound by returning out of loop as soon as a duplicate is found (multiple returns all over the place are bad, but that's not the case here.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
bool hasDuplicates(const string& line) // pass by const ref
{
    stringstream ss(line);
    string token; // std::string inits itself
    string latestToken;

    while (ss >> token) // now using extraction op
    {
        if (token == latestToken)
        {
            return true; // return as soon as first duplicate found
        }

        latestToken = token;
    }

    return false;
}


Program output now! :-)

1 : Did you know this this message contains lots of redundant
2 : code code that does not make sense
4 : All bar two lines (with with text) have have duplicates
5 : in in them them them!
11 : The following following line is just just spaces!
13 : That's all all all, folks!


Andy

PS The next challenge is to deal with punctuation!
Last edited on
Topic archived. No new replies allowed.