Splitting Big CSV file into multiple CSV files

I have a big CSV file (1MM lines) and it has this format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Product Name	Order Date 	Ship Date
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011


I'd like to read in this file and create multiple output files
AAA.CSV
BBB.CSV

Where each CSV file has only the information for that product.

1) I am not sure how to read in the CSV format easily
Someone asked about this on the forum a while ago.

http://www.cplusplus.com/forum/general/13087/
Your example file is not in csv format!

CSV = comma separated values, and I don't see any commas.

Last edited on
While you may validly argue that a CSV file must only have comma-separated fields, in practice people will say CSV as a blanket term for any field separator -- most commonly commas, colons, semicolons, and (as the OP has used) tabs. So it would be valid, if inexact, to call a tab-separated values file a CSV.

Since the OPs file structure is very straightforward, he doesn't actually have to parse much, just check against changes in the first value on the line.

I'm interested, so I'll post back by tomorrow.
Tabs? I guess my browser must be lying to me as I can't see any spaces between the dates on lines 12-15. Whch is why I thought it was a fixed width column format.

(Do you see spaces between the dates?)
Yes. What browser/OS are you using, BTW? (I'm using FF 3.6/Win XP.)

In any case, if you are ever unsure, you can always copy and paste the text, and any tabs will come with it.
OK, so here you go. The trick is that you only need to read the first field of each record. Everything else is just a string. :-)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
#include <algorithm>
#include <cctype>
#include <ciso646>
#include <fstream>
#include <functional>
#include <iostream>
#include <sstream>
#include <string>
using namespace std;

//----------------------------------------------------------------------------
string whitespace = " \f\n\r\t\v";

//----------------------------------------------------------------------------
string toupper( const string& s )
  {
  string result( s.length(), '\0' );
  transform( s.begin(), s.end(), result.begin(), ptr_fun <int, int> ( toupper ) );
  return result;
  }

//----------------------------------------------------------------------------
struct arguments
  {
  string   exename;
  string   filename;
  ifstream f;
  string   filepath;
  string   fileext;
  char     separator;
  bool     ok;
  
  arguments( char* exename, char** arg1, char** argN ):
    exename        ( exename ),
    separator      ( '\t'    ),
    ok             ( false   )
    {
    if (argN - arg1 <= 3)
      while (arg1 != argN)
        {
        string s( *arg1++ );
        string S = toupper( s );

        if (S.find( "/S=" ) == 0)
          switch (S.length())
            {
            case 3:  separator = ' ';
            case 4:  separator = s[ 3 ];
            default: return;  //error
            }

        else
          ok = init_filename( s          )
            or init_filename( s + ".txt" )
            or init_filename( s + ".csv" );
        }
    }

  bool init_filename( string s )
    {
    filename   = s;
    size_t n   = s.find_last_of( "/\\" );
    if (n != string::npos)
      {
      filepath = s.substr( 0, n + 1 );
      s        = s.substr(    n + 1 );
      }
    n          = s.find_last_of( "." );
    if (n != string::npos)
    fileext    = s.substr( n );

    f.clear();
    f.open( filename.c_str() );

    return f.is_open();
    }
  };

//----------------------------------------------------------------------------
int main( int argc, char** argv )
  {
  arguments args( argv[ 0 ], argv + 1, argv + argc );

  if (!args.ok)
    {
    if (!args.filename.empty() and !args.f.is_open())
      cerr << "I could not open the file: " << args.filename << "\n\n";

    cerr << "usage:\n  " << args.exename << " [OPTIONS] FILENAME\n\n"
            "Splits the given tab-delimited FILENAME into tab-delimited\n"
            "files named after the first field, where each new file has\n"
            "the same value in all the first fields.\n"
            "Any header is repeated for each file.\n\n"

            "OPTIONS\n"
            "  /s=,  Specifies the separator to be a comma instead of a tab.\n"
            "        Any printable character can be used, like : and ;\n\n"

            "  /c    Specifies that the first fields need to be compared with\n"
            "        case-sensitivity. The default is case-INsensitive matching.\n\n";
    return 1;
    }

  cout << "filename = " << args.filename << endl;
  cout << "separator = '" << args.separator << "'\n";

  string   header;
  string   outfile_basename;
  string   first_field;
  string   s;
  ofstream f;

  getline( args.f, header );

  while (getline( args.f, first_field, args.separator ))
    {
    // If we need to start a new file
    if (toupper( first_field ) != toupper( outfile_basename ))
      {
      f.close();
      f.clear();
      outfile_basename = first_field;
      f.open( (args.filepath + outfile_basename + args.fileext).c_str() );
      f << header << "\n";
      }

    // Get the remainder of the record and copy it over
    getline( args.f, s );
    f << first_field << args.separator << s << "\n";
    }

  cout << "done.\n";
  return 0;
  }

That should do it. (Minimally tested.)

Hope this helps.
Duaos - I was using IE8 (on XP) when I had the problem. It looks much clearer in Firefox and Opera... which I use interchangeably with IE. Well, now I know which one to avoid when visiting cplusplus.com!
Ah, I see what you mean, it looks weird like that. IIRC (now that I think about it) IE has always been a little off when handling TABs in the display. Works nicely in Safari too...

I don't have IE9. I wonder if that works better? (Supposedly the underlying engine is changed somewhere in there, but I don't remember where, whether it was before 8 or after...)
IE9 is just as bad :-(

(this time I'm using my laptop with Vista and IE9)
Topic archived. No new replies allowed.