Accessing CSV files in practice -- read only?

I've been working on learning more about handling CSV files. There is unfortunately not a whole lot of good stuff on handling the (deceptively simple) beasts properly.

I've played with a number of non-commercial grade solutions, even Boost Tokenizer and Boost Spirit, but my own DFA parser beat's Boost's stuff hands down.

(I've loaded a 1000000 record, 10 field-per-record file into a deque<deque<string>> in 19 seconds, which I think is still an awfully long time... though a significant portion of that time is due to the STL playing with the vector/deque/whatever)

The issues with handling any CSV is that it isn't really line based, and certain variations are not compatible. The best CSV reference is found here:
http://creativyst.com/Doc/Articles/CSV/CSV01.htm

And, of course, there's the completely worthless RFC 4180:
http://tools.ietf.org/html/rfc4180


Q1: What tools do you use, professionally?
Which ones are of sufficient quality and use to recommend?

http://www.csvreader.com/
https://code.google.com/p/fast-cpp-csv-parser/
OCX controls to manipulate Excel
StrTk's DSV Filter library (http://www.partow.net/programming/dsvfilter/index.html)
Others?

I'm not really looking for the usual rinky-dink "this works iff you can put up without X", like the "use getline()" responses.

I'm looking for a good, full-featured CSV reader that can, at the minimum, handle:
  - quoted fields with embedded newlines
  - optionally handle c-escapes like \"
  - not barf on mixed fields, like the Excel ="00123" hack

It would be nice if it could also handle I/O in parts (for huge files [>2MB], which seems to be a common problem on google).

My own reader does all this, but I can't believe I've written a superior solution to what professionals use... (not at 19 seconds. If it were less than a couple seconds, maybe.) But I can't find anything reasonably fast and complete.

Or do you professionals only use a subset of these features at work?


Q2: How often do you actually need to modify CSV data?
In my own (exceedingly limited) experience, CSV files are generated by tail processes, and the need to actually modify a record is less common than the need to simply access a record.

Is this correct? How often do you simply need to read a CSV as part of some data transformation instead of actually having to update the CSV file itself?

It seems to me that some significant speedups could be made by simply referencing the fields instead of making full string copies of them in memory.


Thank you for reading and for your time. What you teach me here will go into the FAQ. http://www.cplusplus.com/faq/sequences/strings/csv/


PS Q3: Should I add Arash Partow's StrTk to the split FAQ?
Does anyone use it?
http://www.partow.net/programming/strtk/index.html
http://www.cplusplus.com/faq/sequences/strings/split/
Last edited on
I don't have professional answers for Q1 and Q3, but for Q2: how often do you actually need to modify CSV data:

I run environments where we use CSV as data files. The data is generated one-time only and distributed along with the binaries. Runtime access is limited to read-only. Writing is only done once, by the developer or integrator and so there is no constraint on write times. An optimization which reduces your load time from 19s to 16s in write-mode would not have a significant effect on our workflow and no effect on our runtime. If we can improve times in read-mode only, that would have an effect on our run-time and that would be quite valuable.
Last edited on
Thank you for that! That is what I was thinking. I am curious as to whether that holds true in general for other professionals as well.

If you don't mind my asking, what tool do you use to access the CSV data? Is it an extant library or an in-house tool?

Are there any constraints on your CSV data that your tool exploits? (Like, can your tools assume that no record contains an embedded newline?)

Thanks again!
I would still like more feedback by people who actually use this in professional practice, if possible.


I figured out what was killing my time. For some reason (I'm still not sure why), the field-accumulator string's capacity is being reset to zero every time I add a field to a record. I've got to study it more, but this is killing performance. If I switch to using straight-up character arrays with indices into it, performance improves to about 5 seconds for the 1,000,000 line CSV file, which is more in line with what I was looking for (and easily comparable to libcsv's time -- I'll just have to tweak a couple of things once I figure out what I'm going to do with the field strings).


I plan to recommend the libcsv package in the FAQ, but, excellent as it is, it still cannot handle a couple of important CSV caveats, notably c-escape sequences (like \n and \EOL).

I'll also point out Partow's package, but his CSV/DSV stuff is line-based, which is even more restrictive.

I'm shocked that a general purpose CSV parser in C/C++ isn't available. I hacked up mine in a day (most of it was spent drawing a DFA) and it correctly handles all the stuff in the creatyvist document (linked above). Is this normal?
Topic archived. No new replies allowed.