Merge and sort list of strings

I am looking for a function or algorithm to best merge and sort similar content between two lists of unordered strings each in individual files (very large files ~200mb each).

For example, these files have a common first string and are merged based on them:
File 1:
1
2
3
4
5
red, apple
green, truck
blue, car
yellow, ball
orange, candy


File 2:
1
2
3
4
5
gold, necklace
green, tree
yellow, sticker
blue, water
red, bag


I am looking for the following output:

Output:
1
2
3
4
5
6
red, apple, bag
green, truck, tree
blue, car, water
yellow, ball, sticker
orange, candy
gold, necklace

Last edited on
I am looking for a function or algorithm to best merge and sort ...
https://en.wikipedia.org/wiki/Merge_sort

very large files ~200mb each
A typical desktop system will fit these into memory without issue, so you can do it all in memory.
If there is enough memory to spare (say, a gigabyte), std::map<> can do the sorting and merging for us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <map>
#include <set>
#include <string>
#include <iostream>
#include <sstream>

// assumption: elininate duplicates; mapped_type is std::set<std::string>
// (otherwise; mapped_type would be std::vector<std::string>)
using map_type = std::map< std::string, std::set<std::string> > ;

void insert( std::istream& stm, map_type& map )
{
    std::string first, second ;
    while( std::getline( stm >> std::ws, first, ',' ) &&
           std::getline( stm >> std::ws, second ) ) map[first].insert(second) ;
}

int main()
{
    std::istringstream file1
    (
        "red, apple\n"
        "green, truck\n"
        "blue, car\n"
        "yellow, ball\n"
        "orange, candy\n"
    ) ;

    std::istringstream file2
    (
        "gold, necklace\n"
        "green, tree\n"
        "yellow, sticker\n"
        "blue, water\n"
        "red, bag\n"
    ) ;

    map_type map ;
    insert( file1, map ) ;
    insert( file2, map ) ;

    for( const auto& pair : map )
    {
        std::cout << pair.first  ;
        for( const std::string& s : pair.second ) std::cout << ", " << s  ;
        std::cout << '\n' ;
    }
}

http://coliru.stacked-crooked.com/a/1bea394d4c273979
Thank you, JLBorges, that works like a charm. Although I wasn't clear on one detail which you picked up on nonetheless. I am trying to do without eliminating duplicates. For example, note the underlined/bold for changes.

File 1:

1
2
3
4
5
6
red, apple
red, bird
green, truck
blue, car
yellow, ball
orange, candy



File 2:
1
2
3
4
5
gold, necklace
green, tree
yellow, sticker
blue, water
red, bag


Output: (Essentially File 1 lines with File 2 data appended)
1
2
3
4
5
6
7
red, apple, bag
red, bird, bag
green, truck, tree
blue, car, water
yellow, ball, sticker
orange, candy
gold, necklace
It gets a wee bit more elaborate:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include <map>
#include <vector>
#include <string>
#include <iostream>
#include <sstream>

using map_type = std::multimap< std::string, std::vector<std::string> > ;

void insert_first( std::istream& stm, map_type& map )
{
    std::string first, second ;
    while( std::getline( stm >> std::ws, first, ',' ) &&
           std::getline( stm >> std::ws, second ) ) map.emplace( first, std::vector<std::string>( 1, second ) ) ;
}

void insert_second( std::istream& stm, map_type& map )
{
    std::string first, second ;
    while( std::getline( stm >> std::ws, first, ',' ) &&
           std::getline( stm >> std::ws, second ) ) 
    {
        auto bounds = map.equal_range(first) ;
        
        if( bounds.first == bounds.second ) // not found, insert key and value
            map.emplace( first, std::vector<std::string>( 1, second ) ) ;
            
        else for( auto iter = bounds.first ; iter != bounds.second ; ++iter ) // found, append the value to all keys
            iter->second.push_back(second) ;
    }
}

int main()
{
    std::istringstream file1
    (
        "red, apple\n"
        "red, bird\n"
        "green, truck\n"
        "blue, car\n"
        "yellow, ball\n"
        "orange, candy\n"
    ) ;

    std::istringstream file2
    (
        "gold, necklace\n"
        "green, tree\n"
        "yellow, sticker\n"
        "blue, water\n"
        "red, bag\n"
    ) ;

    map_type map ;
    insert_first( file1, map ) ;
    insert_second( file2, map ) ;

    for( const auto& pair : map )
    {
        std::cout << pair.first  ;
        for( const std::string& s : pair.second ) std::cout << ", " << s  ;
        std::cout << '\n' ;
    }
}

http://coliru.stacked-crooked.com/a/fb4a048c7c6205f9
Excellent! You sir make me jealous of your talent! Any particular reason why c++11? Or is it just preference.
> of your talent!

There is really no talent involved. Just familiarity with the standard C++ library.


> Any particular reason why c++11?

At this point in time, C++ is C++11. There is no good reason not to use something that is current.
We wouldn't want to use Windows 98 today would we? Why should it be any different with C++.

(Even if we do not use any feature added in C++11, just re-compiling legacy C++98 code (well-written) in C++11 would give us a performance boost.)
Topic archived. No new replies allowed.