Boost Tokenizer Class

I am using the Boost tokenzer class to parse very large (100mb +) csv files. The results are generally very good; however, my current implementation requires me to iterate over all the tokens when I only what some of them. Is there a way to read the tokens by providing an index number rather than iterating over the entire set?

For instance. If I have a file with 30 coma separated data elements per line, I am currently tokenizing one line at a time, then iterating over all the tokens to pull out the 1st, 7th and 22nd token. While it works it is rather slow. What I would like to do is request the 1st, 7th and 22nd token directly. Is this possible with the Boost tokenizer?

Below is the relevant code snippet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int column = 1;
tokenizer<escaped_list_separator<char> > tok(str);
for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg)
  {
    switch (column)
    {
      case 1: x = *beg;
      break;
      case 7: y = *beg;
      break;
      case 22: z = *beg;
  }
  ++column;
}


Thanks
I have no experience with boost tokenizer, but I'll say no.

Surely you must agree that all values need to be parsed to know which cell is 7th. Consider the line
foo, "bar, baz", \" bar, baz \"
Commas, quotes and escape characters all need to be understood to say how many cells this line contains. The cost of actually creating the parsed string should not be too significant.
> I have a file with 30 coma separated data elements per line ...
> .. pull out the 1st, 7th and 22nd token

Could be slightly faster if you pull the 1st and 7th token iterating forwards through the string, and then pull the 22nd token by iterating through the string in reverse.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <string>
#include <boost/tokenizer.hpp>
#include <iostream>

int main()
{
    std::string str = "ab, c, ++++2++++, defg, hi, jkmn, o, p, qrs, tu, v, ----2----, wx, yz" ;
    boost::char_separator<char> fn( ", \t" ) ;

    // get the token two places from begining
    boost::tokenizer< boost::char_separator<char> > toker( str, fn) ;
    auto iter = toker.begin() ;
    std::cout << *++++iter << '\n' ; // ++++2++++

    // get the token two places before the end
    boost::tokenizer< boost::char_separator<char>,
                      std::string::const_reverse_iterator > rev_toker( str.rbegin(), str.rend(), fn ) ;
    auto rev_iter = rev_toker.begin() ;
    std::cout << *++++rev_iter << '\n' ; // ----2----
}
Thank you both for your replies.

hamsterman: I agree with you that all the values need to be parsed in order to know the position. If the parsing doesn't happen until you iterate—which appears may be the case—I am stuck iterating.

JLBorges: That is actually a great idea; I don't know why I didn't think of it myself.
Topic archived. No new replies allowed.