Splitting Big CSV file into multiple CSV

Forum

Forum
General C++ Programming
Splitting Big CSV file into multiple CSV

Splitting Big CSV file into multiple CSV files

I have a big CSV file (1MM lines) and it has this format

Product Name	Order Date 	Ship Date
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/3/2011	1/10/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
AAA	1/10/2011	1/17/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/3/2011	1/10/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011
BBB	1/10/2011	1/17/2011

I'd like to read in this file and create multiple output files
AAA.CSV
BBB.CSV

Where each CSV file has only the information for that product.

1) I am not sure how to read in the CSV format easily

h9uest (157)

Someone asked about this on the forum a while ago.

http://www.cplusplus.com/forum/general/13087/

andywestken (4094)

Your example file is not in csv format!

CSV = comma separated values, and I don't see any commas.

Last edited on

Duthomhas (13310)

While you may validly argue that a CSV file must only have comma-separated fields, in practice people will say CSV as a blanket term for any field separator -- most commonly commas, colons, semicolons, and (as the OP has used) tabs. So it would be valid, if inexact, to call a tab-separated values file a CSV.

Since the OPs file structure is very straightforward, he doesn't actually have to parse much, just check against changes in the first value on the line.

I'm interested, so I'll post back by tomorrow.

andywestken (4094)

Tabs? I guess my browser must be lying to me as I can't see any spaces between the dates on lines 12-15. Whch is why I thought it was a fixed width column format.

(Do you see spaces between the dates?)

Duthomhas (13310)

Yes. What browser/OS are you using, BTW? (I'm using FF 3.6/Win XP.)

In any case, if you are ever unsure, you can always copy and paste the text, and any tabs will come with it.

Duthomhas (13310)

OK, so here you go. The trick is that you only need to read the first field of each record. Everything else is just a string. :-)

#include <algorithm>
#include <cctype>
#include <ciso646>
#include <fstream>
#include <functional>
#include <iostream>
#include <sstream>
#include <string>
using namespace std;

//----------------------------------------------------------------------------
string whitespace = " \f\n\r\t\v";

//----------------------------------------------------------------------------
string toupper( const string& s )
  {
  string result( s.length(), '\0' );
  transform( s.begin(), s.end(), result.begin(), ptr_fun <int, int> ( toupper ) );
  return result;
  }

//----------------------------------------------------------------------------
struct arguments
  {
  string   exename;
  string   filename;
  ifstream f;
  string   filepath;
  string   fileext;
  char     separator;
  bool     ok;
  
  arguments( char* exename, char** arg1, char** argN ):
    exename        ( exename ),
    separator      ( '\t'    ),
    ok             ( false   )
    {
    if (argN - arg1 <= 3)
      while (arg1 != argN)
        {
        string s( *arg1++ );
        string S = toupper( s );

        if (S.find( "/S=" ) == 0)
          switch (S.length())
            {
            case 3:  separator = ' ';
            case 4:  separator = s[ 3 ];
            default: return;  //error
            }

        else
          ok = init_filename( s          )
            or init_filename( s + ".txt" )
            or init_filename( s + ".csv" );
        }
    }

  bool init_filename( string s )
    {
    filename   = s;
    size_t n   = s.find_last_of( "/\\" );
    if (n != string::npos)
      {
      filepath = s.substr( 0, n + 1 );
      s        = s.substr(    n + 1 );
      }
    n          = s.find_last_of( "." );
    if (n != string::npos)
    fileext    = s.substr( n );

    f.clear();
    f.open( filename.c_str() );

    return f.is_open();
    }
  };

//----------------------------------------------------------------------------
int main( int argc, char** argv )
  {
  arguments args( argv[ 0 ], argv + 1, argv + argc );

  if (!args.ok)
    {
    if (!args.filename.empty() and !args.f.is_open())
      cerr << "I could not open the file: " << args.filename << "\n\n";

    cerr << "usage:\n  " << args.exename << " [OPTIONS] FILENAME\n\n"
            "Splits the given tab-delimited FILENAME into tab-delimited\n"
            "files named after the first field, where each new file has\n"
            "the same value in all the first fields.\n"
            "Any header is repeated for each file.\n\n"

            "OPTIONS\n"
            "  /s=,  Specifies the separator to be a comma instead of a tab.\n"
            "        Any printable character can be used, like : and ;\n\n"

            "  /c    Specifies that the first fields need to be compared with\n"
            "        case-sensitivity. The default is case-INsensitive matching.\n\n";
    return 1;
    }

  cout << "filename = " << args.filename << endl;
  cout << "separator = '" << args.separator << "'\n";

  string   header;
  string   outfile_basename;
  string   first_field;
  string   s;
  ofstream f;

  getline( args.f, header );

  while (getline( args.f, first_field, args.separator ))
    {
    // If we need to start a new file
    if (toupper( first_field ) != toupper( outfile_basename ))
      {
      f.close();
      f.clear();
      outfile_basename = first_field;
      f.open( (args.filepath + outfile_basename + args.fileext).c_str() );
      f << header << "\n";
      }

    // Get the remainder of the record and copy it over
    getline( args.f, s );
    f << first_field << args.separator << s << "\n";
    }

  cout << "done.\n";
  return 0;
  }

That should do it. (Minimally tested.)

Hope this helps.

andywestken (4094)

Duaos - I was using IE8 (on XP) when I had the problem. It looks much clearer in Firefox and Opera... which I use interchangeably with IE. Well, now I know which one to avoid when visiting cplusplus.com!

Duthomhas (13310)

Ah, I see what you mean, it looks weird like that. IIRC (now that I think about it) IE has always been a little off when handling TABs in the display. Works nicely in Safari too...

I don't have IE9. I wonder if that works better? (Supposedly the underlying engine is changed somewhere in there, but I don't remember where, whether it was before 8 or after...)

andywestken (4094)

IE9 is just as bad :-(

(this time I'm using my laptop with Vista and IE9)

Topic archived. No new replies allowed.

C++

Forum

Splitting Big CSV file into multiple CSV files