Searching for a specific value in a file using C++

OS: Windows 10, Visual Studio 2015

I have worked on a code file and got stuck on adding the "search in file" function in it. I want it to Open the file.txt file and search for a specific word, then it goes down 1 line an check if the word under it is "banana" for example, if it is true, then it goes down 1 line and takes the number that is below that found word. If it is false, search again for that word and do the same again. I want to do it several times for several values.

For example, here is the contents of the file:

1
2
3
      123456         678942
       kg/s           Pa
       26.87         6.58E6


If the user input 123456, it will search for this number and checks the word in the next line. If it is "kg/s", it will go down by one line and takes the number and stores it in the array. if it is not kg/s, it will search again until it finds it. The same with the case if 678942, if the next line word is Pa, it will take the value under it.

I have been trying and searching for a way to do that, but I couldn't, since I'm still new to programming with C++.

Thank you in advance
Thinking of it, you could load the content of the file into some variable(basically a string) separating contents of different lines and columns using the whitespaces in the text file. Cumbersome though.

Aceix.
Hmm, interesting.

The important trick is the distance from the beginning of the line of the 'found' number.
The lines under it (should be) lined up with it, right?

Stated another way, there should be the same number of columns in front of the column the information is found on, right?

(I cannot assume that the file is purely tabular?)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <algorithm>
#include <deque>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

int main()
{
  std::vector <std::string> lines;

  // load from file
  {
    std::ifstream f( "foo.txt" );
    std::string s;
    while (std::getline( f, s )) lines.emplace_back( s );
  }

  // find something
  std::cout << "Find what number? ";
  std::string number;
  std::getline( std::cin, number );

  std::cout << "Units? ";
  std::string units;
  std::getline( std::cin, units );

  // Search every line (except the last two)
  for (std::size_t n = 0; n < lines.size() - 2; n++)
  {
    // Should we not bother messing with this line?
    if (lines[ n ].find( number ) == std::string::npos) continue;

    // A little function to split a line into a record of columns
    auto split = []( const std::string& record ) -> std::vector <std::string>
    {
      std::vector <std::string> data;
      std::istringstream ss( record );
      std::string s;
      while (ss >> s) data.emplace_back( s );
      return data;
    };

    // For each matching column in the current line...
    std::vector <std::string> r0 = split( lines[ n ] );
    for (std::size_t col = 0; col < r0.size(); col++)
      if (r0[ col ] == number)
      {
        // If the corresponding column in the second line matches...
        std::vector <std::string> r1 = split( lines[ n + 1 ] );
        if (r1[ col ] == units)
        {
          // Then we've found it. Get the corresponding column in the third line.
          std::vector <std::string> r2 = split( lines[ n + 2 ] );
          std::cout << number << "\n" << units << "\n" << r2[ col ] << "\n";
          return 0;
        }
      }
  }
  std::cout << "Alas, not found!\n";
}

Enjoy!

[edit] Added some improved comments.
Last edited on
Um, also, if your input data file is HUGE, you can improve search times significantly by memoizing the split lines -- you only need to keep up to two in memory at a time.

Let me know if that's the case.
Thanks Aceix & Duoas for the help, I have to understand the code line by line now :)

The lines under it (should be) lined up with it, right?


Yea, but I think sometime the number have too many digits that it will be a little bigger than the number in the first line:

1
2
3
                                1234567
                                   K      
                               594.695489



Um, also, if your input data file is HUGE, you can improve search times significantly by memoizing the split lines -- you only need to keep up to two in memory at a time.


Yes, it has more than 30k lines.

The needed search area is always at the last 500 lines of the file, is there a way to skip to these last 500 lines?

Thank you again
You have to read the file twice - once discarding all data and just counting lines, the second time discarding all data and stopping just before the 500th-from-last line.

EDIT: As JLBorges points out below, even if your lines are 10k characters you'd still be using less than half a gig from storing everything in memory.
Last edited on
> Yes, it has more than 30k lines.
> is there a way to skip to these last 500 lines?

Read the whole file into memory, if the average size of a line is only, say, 10000 characters. (30K * 10K == 300M)

If that is not possible, read into a circular buffer. Something like this, perhaps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <iostream>
#include <string>
#include <fstream>
#include <deque>
#include <algorithm>
#include <vector>
#include <iterator>

struct circular_buffer
{
    explicit circular_buffer( std::size_t buffsz = 500 ) : sz(buffsz) {}

    void push( std::string&& line )
    {
        buffer.push_front( std::move(line) ) ; // note: rbegin(), rend() in move_lines()
        static const std::size_t max_cache_size = std::max( sz, std::size_t(1000) ) ;
        if( buffer.size() > max_cache_size ) buffer.resize(sz) ;
    }

    std::vector<std::string> move_lines()
    {
        if( buffer.size() > sz ) buffer.resize(sz) ;
        // http://en.cppreference.com/w/cpp/iterator/move_iterator
        return { std::make_move_iterator( buffer.rbegin() ), std::make_move_iterator( buffer.rend() ) } ;
    }

    std::size_t sz ;
    std::deque<std::string> buffer ;
};

std::vector<std::string> get_last_n_lines( std::ifstream file, std::size_t n )
{
    circular_buffer buffer(n) ;
    std::string line ;
    while( std::getline( file, line ) ) buffer.push( std::move(line) ) ;
    return buffer.move_lines() ;
}

int main()
{
    for( std::string line : get_last_n_lines( std::ifstream( __FILE__ ), 10  ) ) std::cout << line << '\n' ;
}
// **** last line **** 

http://coliru.stacked-crooked.com/a/74b60c474939a46c
Here's a current solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
#include <algorithm>
#include <ctime>
#include <deque>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <vector>

//----------------------------------------------------------------------------
// JLBorges's Circular Buffer for the last N lines of a file
// Modified to store lines in a std::deque<> instead of a std::vector<>
//
template <typename ElementType, std::size_t MaxCacheSize = 1000>
struct circular_buffer
{
    explicit circular_buffer( std::size_t buffsz = 500 ) : sz(buffsz) {}

    void push( ElementType&& line )
    {
        buffer.push_front( std::move(line) ) ; // note: rbegin(), rend() in move_lines()
        static const std::size_t max_cache_size = std::max( sz, MaxCacheSize ) ;
        if( buffer.size() > max_cache_size ) buffer.resize(sz) ;
    }

    template <typename Container>
    Container move_lines()
    {
        if( buffer.size() > sz ) buffer.resize(sz) ;
        // http://en.cppreference.com/w/cpp/iterator/move_iterator
        return { std::make_move_iterator( buffer.rbegin() ), std::make_move_iterator( buffer.rend() ) } ;
    }

    std::size_t sz ;
    std::deque<ElementType> buffer ;
};

template <typename Container>
Container get_last_n_lines( std::ifstream file, std::size_t n )
{
    circular_buffer<std::string> buffer(n) ;
    std::string line ;
    while( std::getline( file, line ) ) buffer.push( std::move(line) ) ;
    return buffer.move_lines<Container>() ;
}

//----------------------------------------------------------------------------
// A parsed-into-columns line AND its 'number' for memoization.
//
struct record
{
    typedef std::vector <std::string> items_type;
    
    std::size_t line_number;
    items_type  items;
    
    record() = default;
    record( std::size_t line_number, const items_type& items ): 
        line_number( line_number ),
        items( std::move( items ) )
        { }
};

//----------------------------------------------------------------------------
// Split the line into whitespace-delineated columns
//
record::items_type split( const std::string& record )
{
    std::istringstream ss( record );
    return
    { 
        std::make_move_iterator( std::istream_iterator <std::string> ( ss ) ),
        std::make_move_iterator( std::istream_iterator <std::string> () )
    };
}

//----------------------------------------------------------------------------
template <typename Container>
std::string find_match( const std::string& s1, const std::string& s2, const Container& lines )
{
    // There must be at least three rows
    if (lines.size() < 3) return "";
    
    // Here we memoize the parsed records
    circular_buffer <record, 10> records( 2 );
    
    // This function looks up an already-parsed record.
    // If not found, it parses the line and adds it to the memoized list of records.
    auto get_record = [ &records ]( std::size_t line_number, const std::string& line )
    {
        for (auto buffer : records.buffer)
            if (buffer.line_number == line_number)
                return buffer.items;
        records.push( record( line_number, split( line ) ) );
        return records.buffer[0].items;
    };
    
    // Search every line except the last two for s1
    std::size_t num_lines   = lines.size() - 2;
    std::size_t line_number = -1;
    for (const auto& line : lines) 
        if (++line_number >= num_lines) break; 
        else
        {
            // Should we ignore this line?
            if (line.find( s1 ) == line.npos) continue;
            
            // line (string) --> record (columns)
            record::items_type r1 = get_record( line_number, line );
            
            // For each matching column in the current record
            for (std::size_t col = 0; col < r1.size(); col++)
                if (r1[col] == s1)
                {
                    // If the corresponding column in the second line matches s2
                    record::items_type r2 = get_record( line_number + 1, lines[ line_number + 1 ] );
                    if (r2[ col ] == s2)
                    {
                        record::items_type r3 = split( lines[ line_number + 2 ] );
                        return r3[ col ];
                    }
                }
        }
    
    return "";
}

//----------------------------------------------------------------------------
void seconds( std::time_t a, std::time_t b )
{
  double   t = double( b - a ) / CLOCKS_PER_SEC;
  double   s = fmod( t, 60*60 );  t /= 60*60;
  unsigned m = fmod( t, 60 );     t /= 60;
  unsigned h = t;
  std::cout 
    <<                   std::setfill( '0' ) << h << ":"
    << std::setw( 2 ) << std::setfill( '0' ) << m << ":"
    << std::setw( 7 ) << std::setfill( '0' )
    << std::fixed << std::setprecision( 4 )  << s
    << "\n";
}

//----------------------------------------------------------------------------
int main( int argc, char** argv )
{
  if (argc != 4)
  {
    std::cerr << "usage: " << argv[0] << " FILENAME NUMBER UNITS\n";
    return 1;
  }

  const auto t1 = std::clock();
  typedef std::deque <std::string> Lines;
  Lines lines = get_last_n_lines <Lines> ( std::ifstream( argv[1] ), 500 );
  
  const auto t2 = std::clock();
  std::string s = find_match( argv[2], argv[3], lines );
  
  const auto t3 = std::clock();
  if (s.empty()) std::cout << "(not found)\n";
  else           std::cout << s << "\n";
  
  std::cout << "read file: ";  
  seconds( t1, t2 );
  
  std::cout << "search:    ";
  seconds( t2, t3 );
}

/*
     123456         678942
      kg/s           Pa
      26.87        6.58E6
*/

Tested on large files (about 6 million lines with 20 columns each) -- performs at about the best C++ I/O can handle: ~7 seconds, which is slower than I'd like...

Because of JLBorges's circular_buffer cache the actual search time is negligible, even when the first match occurs twenty times on every line but with no corresponding match on the line below it.

Hope this helps.
Topic archived. No new replies allowed.