Parse a file as efficiently as possible (with no errors)

I've a file which is divided into two parts:

1. The header
2. The data part

The header has of course a different format than the data part. The header part contains lines that look like this:

KEY: VALUE

or

KEY : VALUE

or

KEY:VALUE

in theory, clearly.

The data part starts with a delimiter, which in my specific case should always be NODE_COORD_SECTION, and ends with another delimiter, which should always be EOF.

In the header, I only need to keep track of two key-value pairs: the DIMENSION and the BEST_KNOWN.

Each line or entry in the data section should be formatted as

INDEX X_COORDINATE Y_COORDINATE

I'm parsing the header and the data section differently. The first one is with getline and the find method of the string class, and the second with the overloaded operator >> of the ifstream object. I've read on a Stack Overflow's post that mixing these two ways of parsing lines could be dangerous and thus should be avoided, but I thought that for my case it would be the best way to do it.

I want my code first to be correct, but I would like it to also be as efficient and fast as possible, because this parsing of the file, structured as I've been describing above, is part of a program which must be as fast as possible.

Note, I'm throwing exceptions whenever a line is not formatted as I was expecting. I think this a good way of handling these kinds of errors in general, but I don't know if in C++ this is actually a good idea. Notice that I've already read something about exception handling in C++.

I think this is the relevant code for you:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
int dimension = -1;
int best_known = -1;
vector <unique_ptr<Node>> nodes;


void read_file(const char *file_name) {

    ifstream file(file_name);

    if (file.fail()) {
        throw runtime_error("file failed to open");
    }

    string line;
    size_t found_at;

    while (getline(file, line)) {

        if (line.find("BEST_KNOWN") != string::npos) {
            found_at = line.find(':');

            if (found_at == string::npos) {
                throw runtime_error("line containing 'BEST_KNOWN' formatted not as expected");
            }

            best_known = stoi(line.substr(found_at + 1));

        } else if (line.find("DIMENSION") != string::npos) {
            found_at = line.find(':');

            if (found_at == string::npos) {
                throw runtime_error("line containing 'DIMENSION' formatted not as expected");
            }

            dimension = stoi(line.substr(found_at + 1));

        } else if (line.find("NODE_COORD_SECTION") != string::npos) {

            if (dimension == -1 || best_known == -1) {
                throw runtime_error("dimension and best known result should have already been read");
            }

            unsigned index;
            double x, y;

            while (file >> index >> x >> y) {
                unique_ptr <Node> p(new Node(index, x, y));
                nodes.push_back(move(p));
            }

            break;
        }
    }

    file.close();
}
Last edited on
Can you show us the contents of the file you are trying to parse?
> The first one is with getline and the find method of the string class, and the second with the
> overloaded operator >> of the ifstream object. I've read on a Stack Overflow's post that mixing
> these two ways of parsing lines could be dangerous and thus should be avoided

Following unformatted input with std::getline() with formatted input with >> would not cause any problems.

If it is the other way around (formatted input with >> followed by unformatted input with std::getline()), we may need to extract and throw away the white space (typically a new line) remaining in the input buffer before the std::getline().

Re.
1
2
3
4
            while (file >> index >> x >> y) {
                unique_ptr <Node> p(new Node(index, x, y));
                nodes.push_back(move(p));
            }

This is idiomatic:
1
2
3
            while (file >> index >> x >> y) {
                nodes.push_back( std::make_unique<Node>(index, x, y) ) ;
            }

http://en.cppreference.com/w/cpp/memory/unique_ptr/make_unique
Thanks.

Currently, I'm not even using smart points, but I'm simply pushing objects (not allocated with new). Maybe later if I realize that by using smart pointers performance is improved, then I will change my code.
> Currently, I'm not even using smart points, but I'm simply pushing objects (not allocated with new).

That is the right way to do it. Favour value semantics - now and later.

Ideally, return a vector of values from the function:
1
2
3
4
5
6
7
8
std::vector<node> read_file( const char *file_name ) {

    std::vector<node> nodes_read ;
    
    // read the contents of the file and push_back/emplace_back nodes one by one

   return nodes_read ;     
}
Topic archived. No new replies allowed.