Deleting repeated lines in a text file

Hi everyone,

I have a text file with repeated lines, and i would like to get rid of the duplicate information, can you help me with an algorithm to achieve this?

Example of my text file (the actual file may be huge, up to 500MB):

1
2
3
4
5
6
7
8
9
10
+Hello
+Bye
*The sun
-The moon
+Bye
/One coin
+Bye
+Bye
+Hello
*The sun


And i would expect to get something like this:

1
2
3
4
5
+Hello
+Bye
*The sun
-The moon
/One coin


I know how to open and read a file with fstream and getline(), but i don't know how to make the comparison.

Thanks.
Last edited on
each time you read a line you should check if the line is already stored in a list.

if it is we ignore line and proceed to next.

if it is not in list we add it.

once all lines from file has been processed we output the list to a new file which should contain no duplicate lines.
Perhaps a set would be useful? As you read a line, try to put the line into a set.
http://www.cplusplus.com/reference/set/set/
It probably won't preserve the order that you see the lines. Would that be a problem?
actually a set would work nicely if we directly output each line to output file after determining that it does not exist in the set and was added to set.


at the end the set may not preserve the original order but has now become insignificant.
Thank to both of you SIK and booradley60, it seems like using sets may solve my problem, sorry if it was too obvious from the beginning, but I didn't knew about the existence of sets.

And yes, the original order is not important.
Last edited on
Yup, that would work nicely.
Topic archived. No new replies allowed.