Unsolved problem, don't let it sink: Read huge txt's into memory efficiently?

Pages: 1234

Thanks for showing copy - I was wondering if that might work.

Given this case where the header is in the first 5 rows, can we use getline to read those, then use copy to pull in the rest?


Hi I think we more or less agree the istringstream class is real *SLOW*. So yes to make it fast, we need to avoid using that and switch to alternatives like what you suggest getline from iostream I believe ?

I do more testing myself yesterday and it *seems* the C++ iostream do "lose-out" to C stdio in terms of raw performance when data gets more say 1 million or more. I have no stats to back me up but I wait for rocketboy9000 to come out with another test program that measure C++ iostream vs C stdio :P

In the meantime, a simple strategy would be we can use C stdio to do massive read or write I/O activities into C++ containers, algorithms, iterators etc provided to do our in-memory operations.

The code so far is probably not a very good test. As far as I know, istringstream will make a copy of the string it is given, which is a non-trivial amount of time, given the size of the string. I don't have time to test this right now.

I generally don't have any use for istringstream. Most of the time, such data is going to be in a file, where you should read it directly (stream >> var;) rather than making multiple copies. If it isn't in a file, then it is often as a const char* (arguments passed in and such), where it just makes more sense to to use c-style functions taking character pointers rather than useless conversions to strings.

I have run into a lot of places where people copy a char* into a string just to pass it into another function as a char* with c_str().
I wait for rocketboy9000 to come out with another test program that measure C++ iostream vs C stdio :P

I'm doing that, actually, but I need an equivalent to the following call:
scanf("%100[^\n]\n",s);
Is there a way to do this using cin >>, or do I have to use getline?
EDIT: I need it to match the last element of the output lines from this perl script:
1
2
3
do{
	print join " ",(int rand(2**16)-2**15-1,int rand(2**32)-2**31-1,sprintf("%x",(int rand 2**32)),rand(),rand(1000)/rand(1000),chr rand(95)+32,"butter","The rest of this line!!!"),"\n";
}until($x++==100000)

Getline works perfectly, but I'm wondering if there is a more scanf-like way.
Last edited on
So, here are preliminary results:
given a file data.txt containing 100000 lines like this:
2478 2108891519 1f545fa8 0.980926676044824 0.135189630563477 b butter The rest of this line!!! 

we want to read them into an array of struct record, and then out put the records again. the command will be:
time ./a.out <data.txt >out.txt

The cstdio way to do this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
struct record{
	short i2; long i4; unsigned long x;
	double f2; float f1;
	char c; char s1[10]; char s2[100];
} a[100000];
int main(){
	struct record *p=a;
	while(p-a<100000&&scanf("%hd %ld %lx %f %lf %c %10s %100[^\n]\n",
		&p->i2, &p->i4, &p->x,
		&p->f1,&p->f2,
		&p->c, p->s1, p->s2
	)>0)p++;
	p=a;
	while(p-a<100000){
		printf("%hd %ld %lx %f %lf %c %s %s\n",
			p->i2, p->i4, p->x,
			p->f1, p->f2,
			p->c, p->s1, p->s2
		);
		p++;
	}
	return 0;
}

with no optimization:
./a.out < data.txt > out.txt  1.85s user 0.11s system 98% cpu 1.979 total

with -O3:
./a.out < data.txt > out.txt  1.11s user 0.08s system 99% cpu 1.202 total

The iostream way to do this AFAIK;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <iomanip>
#include <vector>
#include <iterator>
using namespace std;
struct record{
	short i2; long i4; unsigned long x;
	double f2; float f1;
	char c; char s1[10]; char s2[100];
} a[100000];
int main(){
	record *p=a;
	while(p!=a+100000){
		cin >> p->i2 >> p->i4 
			>> hex >> p->x >> dec
			>> p->f1 >> p->f2
			>> p->c
			>> setw(10) >> p->s1;
		cin.getline(p->s2,100);
		p++;
	}
	cout << "HEY!\n";
	p=a;
	while(p!=a+100000){
		cout << " " << p->i2 
			<< " " << p->i4 
			<< " " << hex << p->x << dec
			<< " " << p->f1
			<< " " << p->f2
			<< " " << p->c
			<< " " << p->s1 
			<< " " << p->s2 << "\n";
		p++;
	}
	return 0;
}

Without optimization:
./a.out < data.txt > out.txt  2.60s user 0.07s system 95% cpu 2.803 total

with -O3:
./a.out < data.txt > out.txt  2.52s user 0.08s system 99% cpu 2.603 total

It is interesting to note how little improvement there is, it may be that method chaining is inherently slow compared to variable argument functions.

Last edited on
Now your latest test seem to indicate the Standard C++ istringstream implementation need some investigation. If istream performance is comparable to C stdio why isn't istringstream exhibit-ing the same performance characteristics?

Maybe Bjarne Strostrup aka C++ creator know the answer ?
About the firs part of the file, I just wrote a little code that should be use.
Maybe it should be useful for some other people that work with ArcGis files...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <fstream>
#include <string>
#include <iostream>
#include <iomanip>
using namespace std;
int ncols, nrows, nodataValue;
double xllcorner, yllcorner, size;
void scrivi ()
{
	string s;
	cout << "dati georiferiti" << "\n";
	ifstream ifs ("D:\\projects\\tesaf\\programmazione\\prova.txt");
	ofstream ofs ("D:\\projects\\tesaf\\programmazione\\demout.txt", ios_base::trunc);
	for (int i=0; i<6; i++)
	{
		getline (ifs, s);
		ofs << s << endl;
	}
	ifs.close ();
	ofs.close ();
}
void ricon()
{
	string s1;
	ifstream ifs1 ("D:\\projects\\tesaf\\programmazione\\prova.txt");
	ifs1 >> s1 >> ncols;
	ifs1 >> s1 >> nrows;
	ifs1 >> s1 >> xllcorner;
	ifs1 >> s1 >> yllcorner;
	ifs1 >> s1 >> size;
	ifs1 >> s1 >> nodataValue;
	cout << "ncols" << ncols << "\n";
	cout << "nrows" << nrows << "\n";
	cout << "xllcorner" << xllcorner << "\n";
	cout << "yllcorner" << yllcorner << "\n";
	cout << "size" << size << "\n";
	cout << "nodataValue" << nodataValue << "\n";
	ifs1.close ();
	cout << "fine elaborazione";
	cout << endl << endl;
}
int main ()
{
	cout << "\aparti";
	scrivi();
	ricon();
	cout << "fine elaborazione";
	cout << endl << endl;
	system ("pause");
	return 0;
}
Topic archived. No new replies allowed.
Pages: 1234