Stream Read Buffer Content

Hi, I'm new to the forum, but I've been referencing this websites forum for help with my program. I have a large data file with 2 million lines, containing a read/write operation & a hex address formatted like this: r abcdef123456 . For improved performance purposes, I tried to read blocks of the file into a buffer, and read and process each line from the buffer. However, I haven't had much luck since the buffer is usually a character pointer, and would like to parse each line to store 2 variable values at the same time. My first approach was initially using fread + fscan, but was concerned that processing the data while reading in the file would slow down performance. My second is with ifstream (posted below), but getline seems to not enter the while-loop. Here's the code I have (sorry if the formatting is incorrect, I don't know how to use the code brackets option in the forum):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
	// Setup variables
	char* fbuffer;
	ifstream ifs("test.txt");
	long int length;
	string op, addr, line;
	clock_t start, end;

	// Start timer + get file length
	start = clock();
	ifs.seekg(0, ifs.end);
	length = ifs.tellg();
	ifs.seekg(0, ifs.beg);

	// Setup buffer to read & store file data
	fbuffer = new char[length];
	ifs.read(fbuffer, length);
	ifs.close();

	// Setup stream buffer
	const int maxline = 20;
	char* lbuffer;
	stringstream ss;

	// Parse buffer data line-by-line
	while(ss.getline(lbuffer, length))
	{
		while(getline(ss, line))
		{
			ss >> op >> addr;
		}
		ss.ignore( strlen(lbuffer));
	}
	end = clock();

	float diff((float)end - (float)start);
	float seconds = diff / CLOCKS_PER_SEC;

	cout << "Run time: " << seconds << " seconds" << endl;

	delete[] fbuffer;
	delete[] lbuffer;

	cout << "Press any key to exit..." << endl;
	cin.get();
	exit(0);


Is there a way to read from a buffer, lines of text that I can then parse into 2 variables simultaneously without jeopardizing performance time? Sorry if this is so long, but I've been looking for a solution for weeks now.
Last edited on
Please edit your post and make sure your code is [code]between code tags[/code] so that it has line numbers and syntax highlighting, as well as proper indentation.

I'm not sure why you think using a buffer will improve performance - the bottleneck will always be the speed of the user's hard drive. The only time the buffer would be useful is if you read the entire file into memory, which isn't a good idea for large files.
Last edited on
Is there an alternative method to reading in a large file with 2 million lines in seconds vs. minutes without the buffer technique? Whenever I searched for suggestions to resolving this issue, fread for block reading or ifstream to buffered memory were frequently mentioned.
The file has plaintext in human-readable format, so you should treat it as such. You can't really buffer it. An alternative method to buffering is: not buffering.
Last edited on
With all due respect, your reply is not really offering any suggestions to resolving this issue. Just flat statements regarding what not to do versus offering an alternative programming approach. I need to read a file with millions of lines quickly for processing (i.e. in seconds). So if buffering is not an advisable method, then what other technique should I implement to handle this massive file? If your response is simply to reiterate avoiding buffers, and nothing else, please do not reply. I'm reaching out for ideas on how to tackle this issue effectively, not to be lambasted for employing a method one personally feels is inefficient.
There is no way to read a file faster than the hard drive is capable of. You could switch to a solid state drive, but I don't think this is the answer you want.
When I started developing this program, using the following method, I was able to get the file data in a fraction of a second:

http://www.cplusplus.com/reference/istream/istream/read/

Once the data has been captured, how does one retrieve the contents from processing? The processing benchmark time is under 2 mins. (so processing 2 million operations on 2 million addresses in under 2 mins.) Ignoring the machine that's being used to execute this program, what is the fastest way to read in a large file?
The fastest way is the only way, I'm not sure what other options you think exist?

From what I understand, you want to eventually process the information in the file - so why not process the information as you read it?

See also JLBorges' post below.
Last edited on
To read formatted data, use formatted input.
(Avoid std::getline() => parse using string stream; that would be what is taking a lot of time)

std::clock() measures processor time; to time i/o bound operations use wall clock time

Flush the system cache buffer before taking actual measurements (the contents of the file may already be in memory). Bomb proof: reboot, wait for the system to reach a quiescent state, measure.

Something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <fstream>
#include <chrono>
#include <ctime>

int main()
{
    // formatted input
    std::ifstream file( "test.txt" ) ;

    namespace chrono = std::chrono ;
    const auto start = chrono::steady_clock::now() ;
    const auto startp = std::clock() ;

    std::string op, addr ;
    while( file >> op >> addr ) { /* do stuff */ }

    const auto end = chrono::steady_clock::now() ;
    const auto endp = std::clock() ;

    std::cout << "  elapsed: " << chrono::duration_cast<chrono::milliseconds>( end - start ).count() << " msecs.\n"
              << "processor: " << double( endp - startp ) * 1000 / CLOCKS_PER_SEC << " msecs.\n" ;
}
@JLBorges: That actually worked out perfectly!!! I was so afraid that just doing the ifstream without a '.read' function was going to slow down my programs performance. But I tested it on the largest file with the 2 million lines and it read and captured both variables in 15 secs!!! Thank you so much for your help, I REALLY appreciate it. For some reason, the std::clock() worked, but I don't know why chrono isn't being detected. I'm using Eclipse CDT 3.8 (which I'm new to using vs. VS2013) to program in C++ 11. So, now I just need to resolve that and I can initialize start/end along with startp/endp. :-D
> but I don't know why chrono isn't being detected

What is the text of the diagnostic from the compiler?
It says "Symbol 'chrono' could not be resolved" for std::chrono. I added the #include files that you posted, but that didn't seem to resolve it. I checked through the 'std' options to see if 'chrono' was one of the options, but I couldn't find it.
Try:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <iostream>
#include <fstream>
#include <chrono>
#include <ctime>

int main()
{
    // formatted input
    std::ifstream file( "test.txt" ) ;

    // namespace chrono = std::chrono ;
    const auto start = std::chrono::steady_clock::now() ;
    const auto startp = std::clock() ;

    std::string op, addr ;
    while( file >> op >> addr ) { /* do stuff */ }

    // const auto end = chrono::steady_clock::now() ;
    const auto end = std::chrono::steady_clock::now() ;
    const auto endp = std::clock() ;

    // std::cout << "  elapsed: " << chrono::duration_cast<chrono::milliseconds>( end - start ).count() << " msecs.\n"
   std::cout << "  elapsed: " << std::chrono::duration_cast<std::chrono::milliseconds>( end - start ).count() << " msecs.\n" 
             << "processor: " << double( endp - startp ) * 1000 / CLOCKS_PER_SEC << " msecs.\n" ;
}


If that too does not work, fall back to legacy C++:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <fstream>
#include <ctime>

int main()
{
    // formatted input
    std::ifstream file( "test.txt" ) ;

    const std::time_t start = std::time(0) ;
    const std::clock_t startp = std::clock() ;

    std::string op, addr ;
    while( file >> op >> addr ) { /* do stuff */ }

    const std::time_t end = std::time(0) ;
    const std::clock_t endp = std::clock() ;

    std::cout << "  elapsed: approximately " << std::difftime( end, start ) << " seconds.\n"
              << "processor: " << double( endp - startp ) * 1000 / CLOCKS_PER_SEC << " msecs.\n" ;
}
Topic archived. No new replies allowed.