I've been working on learning more about handling CSV files. There is unfortunately not a whole lot of
good stuff on handling the (deceptively simple) beasts properly.
I've played with a number of non-commercial grade solutions, even Boost Tokenizer and Boost Spirit, but my own DFA parser beat's Boost's stuff hands down.
(I've loaded a 1000000 record, 10 field-per-record file into a
deque<deque<string>> in 19 seconds, which I think is still an awfully long time... though a significant portion of that time is due to the STL playing with the vector/deque/whatever)
The issues with handling any CSV is that it isn't really line based, and certain variations are not compatible. The best CSV reference is found here:
http://creativyst.com/Doc/Articles/CSV/CSV01.htm
And, of course, there's the completely worthless RFC 4180:
http://tools.ietf.org/html/rfc4180
Q1: What tools do you use, professionally?
Which ones are of sufficient quality and use to recommend?
http://www.csvreader.com/
https://code.google.com/p/fast-cpp-csv-parser/
OCX controls to manipulate Excel
StrTk's DSV Filter library (
http://www.partow.net/programming/dsvfilter/index.html)
Others?
I'm not really looking for the usual rinky-dink "this works iff you can put up without X", like the "use getline()" responses.
I'm looking for a good, full-featured CSV reader that can, at the minimum, handle:
- quoted fields with embedded newlines
- optionally handle c-escapes like \"
- not barf on mixed fields, like the Excel ="00123" hack
It would be nice if it could also handle I/O in parts (for huge files [>2MB], which seems to be a common problem on google).
My own reader does all this, but I can't believe I've written a superior solution to what professionals use... (not at 19 seconds. If it were less than a couple seconds, maybe.) But I can't find anything reasonably fast
and complete.
Or do you professionals only use a subset of these features at work?
Q2: How often do you actually need to modify CSV data?
In my own (exceedingly limited) experience, CSV files are generated by tail processes, and the need to actually modify a record is less common than the need to simply access a record.
Is this correct? How often do you simply need to read a CSV as part of some data transformation instead of actually having to update the CSV file itself?
It seems to me that some significant speedups could be made by simply referencing the fields instead of making full string copies of them in memory.
Thank you for reading and for your time. What you teach me here will go into the FAQ.
http://www.cplusplus.com/faq/sequences/strings/csv/
PS Q3: Should I add Arash Partow's StrTk to the split FAQ?
Does anyone use it?
http://www.partow.net/programming/strtk/index.html
http://www.cplusplus.com/faq/sequences/strings/split/