Sorting large txt file

I am trying to sort a txt file which is about 180MB with 8.6 million lines of data. The file contains resource information on map coordinate locations. The first and second columns keep x and y coordinates. I want to sort by x, then by y. I am not yet sorting by y, but I hope to if I can get this working well. To make it easier for me to learn how to sort this, I have broken off a tenth of the file. It is this file below that I am working with. The program works just fine, except that it will take it hours to finish. What might I do to greatly increase its speed?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <array>

int main()
{
	std::string line;
	std::string transfer;
	int xCoord = 0;
	int yCoord = 0;
	//int wood = 0;
	//int clay = 0;
	//int iron = 0;
	//int stone = 0;
	//int food = 0;

	// Output file opened to save sorted data.
	std::ofstream outFileN1000("IllyriadMapSortN1000V02.txt");

	// The lines of input data begin with xCoord's that range from -1000 to -800.
	for (int outSortIndex = -1000; outSortIndex < -800; outSortIndex++)
	{
		// Input file is reopened at its beginning.
		std::ifstream inFileN1000("IllyriadMapSortN1000.txt");

		// The first line of the input file contains column titles which are discarded.
		std::getline(inFileN1000, line);

		// A loop is begun which will read every line of input.
		while (std::getline(inFileN1000, line))
		{
			// The X Coordinate is extracted from the current line.
			std::istringstream tokenizer(line);
			std::getline(tokenizer, transfer, '|');
			std::stringstream(transfer) >> xCoord;

			// xCoord is saved to output file if it matches the current
			// sequential itteration of outSortIndex.
			if (xCoord == outSortIndex)
				outFileN1000 << line << "\n";

		}
		// The input file is closed so that it can begin again with the next
		// itteration of outSortIndex.
		inFileN1000.close();

		// This statement is used to measure the speed of the
		// program as it is running.
		std::cout << "outSortIndex = " << outSortIndex << std::endl;
	}
}


txt sample:
-904|344|5|4|6|5|5
-862|344|5|3|7|5|5
-857|344|5|4|5|6|5
-855|344|5|4|5|6|5
-853|344|6|4|5|5|5
-848|344|5|4|5|6|5
-845|344|5|4|5|6|5
-833|344|5|4|5|6|5
-831|344|5|4|5|6|5
-830|344|5|4|5|6|5
-829|344|5|4|5|6|5
-821|344|5|4|5|6|5
-814|344|5|4|5|6|5
-810|344|5|4|5|6|5
-808|344|0|0|0|0|4
-803|344|5|4|5|6|5
-890|345|5|5|6|4|5
-867|345|5|5|4|6|5
-865|345|5|5|4|6|5
-893|344|5|5|4|6|5
-892|344|5|5|4|6|5
-868|344|3|7|5|5|5
-849|344|5|5|4|6|5
Last edited on
Don't print progress. Cout is slow and printing millions of times will kill your performance.

I personally read the whole file at once and split it up myself rather than read line by line. I find this is almost always faster for big files.

But try losing that print statement as your first move.
> txt file which is about 180MB with 8.6 million lines of data.

180 MB is not a lot; we can easily hold the entire contents in memory and perform an in-memory sort.

Something along these lines, perhaps:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <utility>
#include <algorithm>

struct line_data
{
    std::pair<int,int> xy_coords ; // x and y coordinates
    std::string text ; // original text of the line
    line_data( int x, int y, std::string txt ) : xy_coords(x,y), text( std::move(txt) ) {}
};

std::vector<line_data> get_selected_lines( std::istream& file, int min_x_coord, int max_x_coord )
{
    std::vector<line_data> filtered_lines ; // vector to hold selected lines

    std::istringstream stm ; // don't want to rely on the optimiser eliding the
                                // construction/destruction of the string stream
                                // during each iteration of the loop
    std::string line ;
    while( std::getline( file, line ) ) // for each line in the input file
    {
        stm.clear() ; // clear a possible error state
        stm.str(line) ;

        int x_coord, y_coord ;
        char separator ;
        if( stm >> x_coord >> separator >> y_coord && // co-ordinates were successfully read
            x_coord >= min_x_coord && x_coord <= max_x_coord ) // and x coordinate is within range
        {
            // add this line to the vector
            filtered_lines.emplace_back( x_coord, y_coord, std::move(line) );
        }
    }

    return filtered_lines ;
}

int main()
{
    const int min_x_coord = -1000 ;
    const int max_x_coord = -800 ;

    std::ifstream inFileN1000( "IllyriadMapSortN1000.txt" );
    std::vector<line_data> selected_lines = get_selected_lines( inFileN1000, min_x_coord, max_x_coord ) ;

    // predicate to sort the vector, on x coordinate first, then on y coordinate
    const auto cmp_coords = [] ( const line_data& a, const line_data& b )
    { return a.xy_coords < b.xy_coords ; };

    std::sort( std::begin(selected_lines), std::end(selected_lines), cmp_coords ) ; // sort the vector

    // Output file opened to save sorted data.
    std::ofstream outFileN1000("IllyriadMapSortN1000V02.txt");

    for( const line_data& ldata : selected_lines ) // for each item in the sorted vector
        outFileN1000 << ldata.text << '\n' ; // write the original line to the output file
}
Thank you very much for your solution. I've been studying it, and I would like some clarification as to what this does: line_data(int x, int y, std::string txt) : xy_coords(x, y), text(std::move(txt)) {} where it is written within the struct. I also have added a few cout statements to understand what is happening while it is running. I changed out my small one-tenth size file for the larger 180M file as suggested. After doing this and running it, it actually runs into an error once vectorCycles reaches 8 million. Again, this larger file has 8.6 million lines in it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <utility>
#include <algorithm>

struct line_data
{
	std::pair<int, int> xy_coords; // x and y coordinates
	std::string text; // original text of the line
	line_data(int x, int y, std::string txt) : xy_coords(x, y), text(std::move(txt)) {}
};

std::vector<line_data> get_selected_lines(std::istream& file, int min_x_coord, int max_x_coord)
{
	int vectorCycles = 0;
	std::vector<line_data> filtered_lines; // vector to hold selected lines

	std::istringstream stm; // don't want to rely on the optimiser eliding the
							// construction/destruction of the string stream
							// during each iteration of the loop
	std::string line;
	while (std::getline(file, line)) // for each line in the input file
	{
		stm.clear(); // clear a possible error state
		stm.str(line);

		int x_coord, y_coord;
		char separator;
		if (stm >> x_coord >> separator >> y_coord && // co-ordinates were successfully read
			x_coord >= min_x_coord && x_coord <= max_x_coord) // and x coordinate is within range
		{
			// add this line to the vector
			filtered_lines.emplace_back(x_coord, y_coord, std::move(line));
		}
		vectorCycles++;
		if (vectorCycles % 100'000 == 0)
			std::cout << "vectorCycles = " << (static_cast<double>(vectorCycles) / 1'000'000) 
			<< "M" << std::endl;
	}

	return filtered_lines;
}

int main()
{
	const int min_x_coord = -1000;
	const int max_x_coord = 1000;

	std::ifstream inFileN1000("IllyriadMapFileEdit02.txt");
	std::vector<line_data> selected_lines = get_selected_lines(inFileN1000, min_x_coord, max_x_coord);

	std::cout << "Mark 01." << std::endl;
	// predicate to sort the vector, on x coordinate first, then on y coordinate
	const auto cmp_coords = [](const line_data& a, const line_data& b)
	{ return a.xy_coords < b.xy_coords; };

	std::cout << "Mark 02. " << std::endl;
	std::sort(std::begin(selected_lines), std::end(selected_lines), cmp_coords); // sort the vector
	
        // Output file opened to save sorted data.
	std::ofstream outFileN1000("IllyriadMapFileEdit03.txt");

	std::cout << "Mark 03." << std::endl;
	for (const line_data& ldata : selected_lines) // for each item in the sorted vector
		outFileN1000 << ldata.text << '\n'; // write the original line to the output file
} 
Last edited on
What kind of error, what's the exact message ?
> I would like some clarification as to what this does:
> line_data(int x, int y, std::string txt) : xy_coords(x, y), text(std::move(txt)) {}
> where it is written within the struct.

It is the constructor for the class:
the member xy_coords is initialised with x,y and the member text is initialised with txt.

We could also write it as:
line_data(int x, int y, const std::string& txt) : xy_coords(x, y), text(txt) {}
The string would then be copied instead of being moved.


> After doing this and running it, it actually runs into an error once vectorCycles reaches 8 million.

Verify that the program is not running out of memory.
What does this print out?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <utility>
#include <algorithm>

struct line_data
{
    std::pair<int, int> xy_coords; // x and y coordinates
    std::string text; // original text of the line
    line_data(int x, int y, std::string txt) : xy_coords(x, y), text(std::move(txt)) {}
};

std::vector<line_data> get_selected_lines(std::istream& file, int min_x_coord, int max_x_coord)
{
    std::vector<line_data> filtered_lines; // vector to hold selected lines

    int total_recs_read = 0;
    int filtered_recs = 0 ;

    try
    {
        std::istringstream stm; // don't want to rely on the optimiser eliding the
        // construction/destruction of the string stream
        // during each iteration of the loop
        std::string line;
        while (std::getline(file, line)) // for each line in the input file
        {
            ++total_recs_read ;

            stm.clear(); // clear a possible error state
            stm.str(line);

            int x_coord, y_coord;
            char separator;
            if (stm >> x_coord >> separator >> y_coord && // co-ordinates were successfully read
            x_coord >= min_x_coord && x_coord <= max_x_coord) // and x coordinate is within range
            {
                // add this line to the vector
                filtered_lines.emplace_back(x_coord, y_coord, std::move(line));
                ++filtered_recs ;
            }

            if( total_recs_read % 1'000'000 == 0 )
            {
                std::cout << "total_recs_read = " << total_recs_read << '\n'
                          << "  filtered_recs = " << filtered_recs << '\n' ;
            }
        }
    }

    catch( const std::bad_alloc& )
    {
        std::cout << "*** allocation failure *** \n"
                  << "    vector size: " << filtered_lines.size() << '\n'
                  << "       capacity: " << filtered_lines.capacity() << '\n' ;
    }
    catch( const std::exception& ) { std::cout << "some other error\n" ; }

    return filtered_lines;
}

int main()
{
    const int min_x_coord = -1000;
    const int max_x_coord = 1000;

    std::ifstream inFileN1000("IllyriadMapFileEdit02.txt");
    std::vector<line_data> selected_lines = get_selected_lines(inFileN1000, min_x_coord, max_x_coord);

    std::cout << "Mark 01." << std::endl;
    // predicate to sort the vector, on x coordinate first, then on y coordinate
    const auto cmp_coords = [](const line_data& a, const line_data& b)
    { return a.xy_coords < b.xy_coords; };

    std::cout << "Mark 02. " << std::endl;
    std::sort(std::begin(selected_lines), std::end(selected_lines), cmp_coords); // sort the vector

    // Output file opened to save sorted data.
    std::ofstream outFileN1000("IllyriadMapFileEdit03.txt");

    std::cout << "Mark 03." << std::endl;
    for (const line_data& ldata : selected_lines) // for each item in the sorted vector
        outFileN1000 << ldata.text << '\n'; // write the original line to the output file
}
total_recs_read = 1000000
filtered_recs = 999999
total_recs_read = 2000000
filtered_recs = 1999999
total_recs_read = 3000000
filtered_recs = 2999999
total_recs_read = 4000000
filtered_recs = 3999999
total_recs_read = 5000000
filtered_recs = 4999999
total_recs_read = 6000000
filtered_recs = 5999999
total_recs_read = 7000000
filtered_recs = 6999999
*** allocation failure ***
vector size: 7972438
capacity: 7972438
If you are really running out of memory, but are prepared to use temporary files on disk, then you can write successive (sorted) subsets to the temporary files, then recover the data from the front of these in order.

No idea how slow the file read/writes will make it, though.

Change recordsPerFile as necessary. Choose what you want to sort on using function cmp().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;

#define SP << " " <<

//======================================================================

class Resource{
public:
   int x, y;
   int wood, clay, iron, stone, food;

   Resource(){};
   Resource( string line );
}; 


Resource::Resource( string line )                          // constructor
{
   replace( line.begin(), line.end(), '|', ' ' );               // not sure if this is actually necessary
   stringstream( line ) >> x >> y >> wood >> clay >> iron >> stone >> food;
}


ostream & operator << ( ostream &strm, Resource &res )     // overload << for Resource
{
   strm << res.x SP res.y SP res.wood SP res.clay SP res.iron SP res.stone SP res.food << '\n';
   return strm;
}


istream & operator >> ( istream &strm, Resource &res )     // overload >> for Resource
{
   strm >> res.x >> res.y >> res.wood >> res.clay >> res.iron >> res.stone >> res.food;
   return strm;
}


bool cmp( Resource a, Resource b )                         // decide how you want to sort
{
   return ( a.x < b.x );
}


//======================================================================


string fileNum( string basename, int num )                 // Construct a numbered filename
{
   stringstream ss;   
   ss << basename << num;                              
   string name = ss.str();     
   return name;
}


//======================================================================


void tempWrite( string filename, vector<Resource> &data )  // Write a temporary file
{
   sort( data.begin(), data.end(), cmp );                  // Sort resource data (according to cmp() )

   ofstream out( filename.c_str() );                       // Output resource data
   for ( auto e : data ) out << e;
   out.close();

   data.clear();                                           // Clear current collection
}


//======================================================================

int main()
{
   const int recordsPerFile = 500000;                      // Change as required

   vector<Resource> data;
   vector<string> tempFiles;
   string line;
   string basename = "temp";
   string filename;
   int totalRecords = 0;
   int records = 0;
   int tempNum = 0;


   // Input data, and distribute to as many temporary files as necessary
   ifstream in( "IllyriadMapSortN1000.txt" );
   getline( in, line );                                    // Ignore first line; it contains headers
   while ( getline( in, line ) )
   {
      totalRecords++;
      records++;

      if ( records > recordsPerFile )                      // If exceeding quota, dump to file and start new collection
      {
         filename = fileNum( basename, tempNum );
         tempFiles.push_back( filename );
         tempWrite( filename, data );
         tempNum++;
         records = 1;
      }
      Resource res( line );
      data.push_back( res );
   }

   // Output last quota
   filename = fileNum( basename, tempNum );                 
   tempFiles.push_back( filename );
   tempWrite( filename, data );                  
   tempNum++;

   in.close();
   cout << "\nNumber of records read = " << totalRecords << '\n';


   // Now open all files and compare data, sending to output file
   ofstream out( "out.txt" );
   vector<bool> filestatus( tempNum );
   vector<Resource> current( tempNum );
   vector<ifstream> infile( tempNum );

   // Open all files and get first value from each file
   for ( int i = 0; i < tempNum ; i++ )
   {
      infile[i].open( tempFiles[i].c_str() );
      filestatus[i] = ( infile[i] >> current[i] );     
   }

   Resource resmin;
   int imin;
   do
   {
      // Find minimum resource
      imin = -1;
      for ( int i = 0; i < tempNum; i++ )                  
      {
         if ( !filestatus[i] ) continue;                             // This file is finished, so skip
         if ( imin == -1 || cmp( current[i], resmin ) )
         {
            resmin = current[i];
            imin = i;
         }
      }

      // Output the minimum, then replace the current value for that file
      if ( imin >= 0 )                                       
      {
         out << current[imin];
         filestatus[imin] = ( infile[imin] >> current[imin] ); 
      }
   } while ( imin >= 0 );                                            // Continue while there are still unexhausted files

   // Tidy up
   for ( int i = 0; i < tempNum ; i++ ) infile[i].close();   
   out.close();
}

Try this first:

Modify the function get_selected_lines to:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
std::vector<line_data> get_selected_lines( std::istream& file, int min_x_coord, int max_x_coord,
                                           std::size_t MAX_RECS = 9'000'000 )
{
    std::vector<line_data> filtered_lines; // vector to hold selected lines

    try
    {
        std::cout << "attempting to reserve space for " << MAX_RECS << " records\n" ;
        filtered_lines.reserve( MAX_RECS ) ; // reserve space for MAX_RECS records
        // defensive: not really required
        if( filtered_lines.capacity() < MAX_RECS ) throw std::bad_alloc() ;
    }

    catch( const std::bad_alloc& )
    {
        std::cout << "*** allocation failure *** \n" ;
        return filtered_lines ;
    }

    std::cout << "vector capacity is: " << filtered_lines.capacity() << '\n' ;
    int total_recs_read = 0;
    int filtered_recs = 0 ;

    std::istringstream stm; // don't want to rely on the optimiser eliding the
    // construction/destruction of the string stream
    // during each iteration of the loop
    std::string line;
    while (std::getline(file, line)) // for each line in the input file
    {
        ++total_recs_read ;

        stm.clear(); // clear a possible error state
        stm.str(line);

        int x_coord, y_coord;
        char separator;
        if (stm >> x_coord >> separator >> y_coord && // co-ordinates were successfully read
        x_coord >= min_x_coord && x_coord <= max_x_coord) // and x coordinate is within range
        {
            // add this line to the vector
            filtered_lines.emplace_back(x_coord, y_coord, std::move(line));
            ++filtered_recs ;
        }

        if( total_recs_read % 1'000'000 == 0 )
        {
            std::cout << "total_recs_read = " << total_recs_read << '\n'
                      << "  filtered_recs = " << filtered_recs << '\n' ;
        }
    }

    return filtered_lines;
}


If there is no allocation failure now, nothing more needs to be done.

If there is an allocation failure, then the simplest solution is to split the data into two sorted segments.
For instance, one segment with x coordinates from -1000 to 0, and another with x coordinates from 1 to 1000.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <utility>
#include <algorithm>

struct line_data
{
    std::pair<int, int> xy_coords; // x and y coordinates
    std::string text; // original text of the line
    line_data(int x, int y, std::string txt) : xy_coords(x, y), text(std::move(txt)) {}
};

std::vector<line_data> get_selected_lines( std::istream& file, int min_x_coord, int max_x_coord,
                                           std::size_t MAX_RECS )
{
    std::vector<line_data> filtered_lines; // vector to hold selected lines
    filtered_lines.reserve( MAX_RECS ) ; // reserve space for MAX_RECS records
    std::cout << "vector capacity is: " << filtered_lines.capacity() << '\n' ;

    int total_recs_read = 0;
    int filtered_recs = 0 ;

    std::istringstream stm; // don't want to rely on the optimiser eliding the
    // construction/destruction of the string stream
    // during each iteration of the loop
    std::string line;
    while (std::getline(file, line)) // for each line in the input file
    {
        ++total_recs_read ;

        stm.clear(); // clear a possible error state
        stm.str(line);

        int x_coord, y_coord;
        char separator;
        if (stm >> x_coord >> separator >> y_coord && // co-ordinates were successfully read
        x_coord >= min_x_coord && x_coord <= max_x_coord) // and x coordinate is within range
        {
            // add this line to the vector
            filtered_lines.emplace_back(x_coord, y_coord, std::move(line));
            ++filtered_recs ;
        }

        if( total_recs_read % 1'000'000 == 0 )
        {
            std::cout << "total_recs_read = " << total_recs_read << '\n'
                      << "  filtered_recs = " << filtered_recs << '\n' ;
        }
    }

    return filtered_lines;
}

void create_temp_file( std::string input_file_name, std::string temp_file_name,
                       int min_x_coord, int max_x_coord, int MAX_RECS = 6'000'000 )
{
    std::ifstream in_file(input_file_name);
    std::vector<line_data> selected_lines = get_selected_lines( in_file, min_x_coord, max_x_coord, MAX_RECS );

    // predicate to sort the vector, on x coordinate first, then on y coordinate
    staric const auto cmp_coords = [](const line_data& a, const line_data& b)
    { return a.xy_coords < b.xy_coords; };

    std::sort(std::begin(selected_lines), std::end(selected_lines), cmp_coords); // sort the vector

    // Output file opened to save sorted data.
    std::ofstream out_file(temp_file_name);

    for (const line_data& ldata : selected_lines) // for each item in the sorted vector
        out_file << ldata.text << '\n'; // write the original line to the output file
}

int main()
{
    const char* const input_file_name = "IllyriadMapFileEdit02.txt" ;
    const char* const temp_file_01 = "IllyriadMapFile_part_01.txt" ;
    const char* const temp_file_02 = "IllyriadMapFile_part_02.txt" ;
    const char* const output_file_name = "IllyriadMapFileEdit03.txt" ;

    // write sorted records for x coordinates -1000 to 0 into temp_file_01
    create_temp_file( input_file_name, temp_file_01, -1000, 0 ) ;

    // write sorted records for x coordinates 1 to 1000 into temp_file_02
    create_temp_file( input_file_name, temp_file_02, -1, 1000 ) ;

    // concatenate the two temp files to get our sorted file for x coordinates from -1000 to 1000
    std::ofstream out_file(output_file_name) ;
    out_file << std::ifstream(temp_file_01).rdbuf() ;
    out_file << std::ifstream(temp_file_02).rdbuf() ;
}



Modifying get_selected_lines worked. Thanks. I have a lot to learn. :)
Topic archived. No new replies allowed.