Can't detect newline while parsing CSV

I am using a CSV file as input. This file gives features of seals, there are 23 values(columns) per entry. When I output rawdata[22] I expect to see the last entry of the first set of data. Instead, I see the last entry (Petitioned) followed by the first entry (2055) of the next seal. When I open this in a hex editor I see the two words are separated by a "." and the hex character is 0a. I have tried setting \r, \n, \r\n, as delimiters but they do not work. I cannot use "." as a delimiter because it is used within strings, I tested it to see if it would work for my issue anyway and it didn't. How to separate these values?

OUTPUT:
Petitioned
2055

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;

int main() {
    string line;
    vector<string> rawdata;
    ifstream file ( "/Users/darla/Desktop/Programs/seals.csv" );
    if ( file.good() )
   {
    while(getline(file, line, '"')) {
        stringstream ss(line);
        while (getline(ss, line, ',')) {
            rawdata.push_back(line);
        }
        if (getline(file, line, '"')) {
            rawdata.push_back(line);
        }
    }
   }
    cout << rawdata[22] << endl;
    
    
    return 0;
}
Last edited on
Can you show us some lines of the input file ?
It would be helpful to see a small sample of your input file.

Also you may want to read an entire line at a time then use the stringstream to process that line.


1
2
3
SpeciesID,Kingdom,Phylum,Class,Order,Family,Genus,Species,Authority,Infraspecific rank,Infraspecific name,Infraspecific authority,Stock/subpopulation,Synonyms,Common names (Eng),Common names (Fre),Common names (Spa),Red List status,Red List criteria,Red List criteria version,Year assessed,Population trend,Petitioned
2055,ANIMALIA,CHORDATA,MAMMALIA,CARNIVORA,OTARIIDAE,Arctocephalus,australis,"(Zimmermann, 1783)",,,,,Arctophoca australis,South American Fur Seal,Otarie  fourrure Australe,Oso Marino Austral,LC,,3.1,2016,increasing,N
41664,ANIMALIA,CHORDATA,MAMMALIA,CARNIVORA,OTARIIDAE,Arctocephalus,forsteri,"(Lesson, 1828)",,,,,Arctocephalus australis subspecies forsteri|Arctophoca australis subspecies forsteri,"New Zealand Fur Seal, Antipodean Fur Seal, Australasian Fur Seal, Black Fur Seal, Long-nosed Fur Seal, South Australian Fur Seal",,,LC,,3.1,2015,increasing,N


That is the first ~69 entries
Last edited on
Please edit your "file" and place the data in code tags so that any whitespace is retained.

Thanks for the code tags.

What do you mean by ~69 entries?

This looks like a single header and two separate Species (two lines). I really suggest you create a class or structure that holds the individual fields of the specific Species possibly using the header names as the structure field names. Ie:
1
2
3
4
5
6
7
8
9
struct Specie
{
   long ID;
   std::string Kingdom;
   std::string Phylum;
   std::string Class;
   std::string Order;
...
};

Now to read the lines you could have something like:

1
2
3
4
5
6
7
8
9
10
11
12
   std::vector<Specie> species ;
   std::string line;
   getline(yourInputStream, line);  // Read and discard header line. 
   while(getline(yourInputStream, line))
   { 
      Specie temp;
      std::stringstream sin(line);
      sin >> temp.ID;
      getline(sin, temp.Kingdom, ',');
      getline(sin, temp.Phylum), ',');
...


Now to the fields that contain special characters like quotation marks and parentheses. It appears that the quotation marks may (not enough file content to be absolutely certain) signify that there may be a vector of some sort within the qoutes. The parentheses seem to hint that the item has multiple parts, ie: a name and a year. So for the "Authority" you may need a vector of a structure (named Authority) that contains the name and the year.

Later in the line it appears that you have a vector of different names for the Specie.

And don't forget that you have "blank" fields, signified by a comma followed by another comma.

Last edited on
I was planning on using structures eventually, but first I have to get the input parsed by comma delimiters, while preserving quoted entries that may contain commas. I have made changes and now the newline issue is resolved. The only problem that remains is to preserve the quoted material, my program does not.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int main() {
    string line;
    vector<string> rawdata;
    ifstream file ( "/Users/mewtwo/Desktop/Programs/seals.csv" );
    while(getline(file, line)) {
        stringstream ss(line);
        while (getline(ss, line, ',')) {
            rawdata.push_back(line);
        }

    }
    cout << rawdata[55] << endl;
    
    
    return 0;
}
Ok well I was provided two options to solve this problem, in case anyone else stumbles upon this with the same question.
https://stackoverflow.com/questions/48085842/cant-detect-newline-while-parsing-csv/48086181?noredirect=1#comment83148149_48086181

1.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;

int main() {
    string line;
    vector<string> rawdata;
    ifstream file ( "/Users/darla/Desktop/Programs/seals.csv" );
    while(getline(file, line)) {
        stringstream ss(line);
        char c;
        bool insideQuotes = false;
        std::string currentField;
        while(ss.get(c)) {
            if(c == '"') {
                insideQuotes = !insideQuotes;
            }
            if(!insideQuotes && c == ',') {
                rawdata.push_back(currentField);
                currentField.clear();
            }
            else {
                currentField += c;
            }
        }
    }
    return 0;



2.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>

int main()
{
    std::string line;
    std::vector<std::vector<std::string>> lines;
    std::ifstream file("/Users/darlaDesktop/Programs/seals.csv");
    
    if (file)
    {
        while (std::getline(file, line))
        {
            size_t n = lines.size();
            lines.resize(n + 1);
            
            std::istringstream ss(line);
            std::string field, push_field("");
            bool no_quotes = true;
            
            while (std::getline(ss, field, ','))
            {
                if (static_cast<size_t>(std::count(field.begin(), field.end(), '"')) % 2 != 0)
                    no_quotes = !no_quotes;
                
                push_field += field;
                
                if (no_quotes)
                {
                    lines[n].push_back(push_field);
                    push_field.resize(0);
                }
            }
        }
    }
    
    for (auto line : lines)
    {
        for (auto field : line)
        {
            std::cout << "| " << field << " |";
        }
        
        std::cout << std::endl << std::endl;
    }
    
    return 0;
}


Both of these options work well, the second one works slightly better for my purposes.
Last edited on
In both cases you have a rudimentary state machine. Full-state CSV parsers are not your friend, because the grammar is not your friend. That is why.

Those both look fine. Use the one that works.
boost::tokenizer with boost::escaped_list_separator
http://www.boost.org/doc/libs/1_66_0/libs/tokenizer/escaped_list_separator.htm

1
2
3
4
5
std::vector<std::string> split_csv( const std::string& line )
{
    const boost::tokenizer< boost::escaped_list_separator<char> > toker(line) ;
    return { std::begin(toker), std::end(toker) } ;
}

http://coliru.stacked-crooked.com/a/2d97109f28ca51fb
Topic archived. No new replies allowed.