Fast file reading in C++

Hello,in my country there is a pretty popular file reading algorithm in the olympiad, that im going to present here,but I dont really understand it,i would be really thankful if someone can explain me how it works and secondly what is BUF_SIZE and why is is 1 << 17 , maybe its the amount of chars in the file or i dont know,i want to understand how it works and why BUF_SIZE is 1 << 17,this file reading function is super efficent , I tested it today on maximal sum seqeunce , and from 1152 ms it got to 157ms with this special reading :
#define BUF_SIZE 1 << 17
 
int pos = BUF_SIZE;
char buf[BUF_SIZE];
 
inline char nextch() {
    if (pos == BUF_SIZE) fread(buf, BUF_SIZE, 1, fin), pos = 0;
    return buf[pos++];
}
 
inline int read() {
    char ch;
    while (!isdigit(ch = nextch()));
    int x = ch - '0';
    while (isdigit(ch = nextch())) x = 10 * x + ch - '0';
    return x;
}
There's nothing special about it. It just reads the file in blocks of 131072 (217) bytes at a time.

i would be really thankful if someone can explain me how it works
It reads the file one big chunk at a time into a buffer, then until the buffer runs out it successively returns every byte in that buffer. Once it runs out it reads another chunk.
One problem I see with this implementation is that there's no checking to see if the size of the input file is a multiple of 217.

secondly what is BUF_SIZE and why is is 1 << 17
217 was chosen arbitrarily. It's probably overkill, to be honest. 212 is probably enough.

i want to understand how it works and why BUF_SIZE is 1 << 17,this file reading function is super efficent
Imagine I hire you to fetch water, and I want to get a million liters of water. Is it going to be faster a) if I give you a 1 liter bottle and have you bring 1 liter a million times, or b) if I give you a 100,000 liter tanker truck and have you bring 100,000 liters 10 times?
Last edited on
Thanks but you can test it and you will see its super efficent ,as i told my max sum sequence got faster by 1 s.
so BUF_SIZE is always same ? or it depends on what data you read?
also as I said , I copied this algorithm from someones source at a problem , and the problem had specified so maybe thats why its 2^17 , thats why i asked what to put to BUF_SIZE
Thanks but you can test it and you will see its super efficent ,as i told my max sum sequence got faster by 1 s.
Reading a file block by block is the standard way of doing it. If your code was previously doing it byte by byte, it would certainly be much, much slower. It's not that this code is super fast, it's that your previous code was awfully slow.

Actually, there are even faster ways than this of reading files, but they involve using threads or asynchronous IO. It can't be done using standard C library functions, AFAIK.
Basically what you do is, you have two buffers, one for reading and one for processing. While one buffer is being filled with data from the file, the other is being processed by the program. Once both the reading and processing complete, the buffers are swapped and the next batch is read-in and processed (again simultaneously).
Double buffering lets you process data as fast as the storage can provide it, as long as your processing function is faster than the storage device.

so BUF_SIZE is always same ? or it depends on what data you read?
Like I said, BUF_SIZE was chosen arbitrarily. Any size larger than a few thousand (let's say 10) will be enough.
Last edited on
when testing, be sure not to run it back to back. The os and the disk may both cache some of the disk data, making it faster on subsequent runs than the initial run. Or, discount the first couple of runs, and then run it a bunch and check those.

lots of little things affect disk reads.. where the file is on the disk, # of fragments, size of sectors, RPM of the disk, buffers, SSD or not, and other factors (what else is going on at the disk level is a big one).

/shrug we used to be told to read one sector into a buffer at a time as the 'most efficient'. How this relates to 2^17 is a mystery to me.

I almost always read the whole file into memory at once. I just find it easier to deal with that way, and it reads and writes fairly efficiently this way. For very, very large files that you want to display, reading some and displaying part while processing the rest is nice; good word processors or video players do this sort of thing. Most of what I do is throwaway file processing that is file in -> file out and the all at once thing works for me, but it depends on what you are doing, very much so.

I don't have a lot more to offer, just be wary of the rabbit hole of improvement. You improved your effort by an order of magnitude. It is very difficult to do that twice; and you could spend a month playing with this code and only end up saving another few MS. At some point, call it good enough. My all at once approach handles multiple GB files in a few 10s of seconds to read-process-write. Even if I doubled its speed, it would not be all that good a use of my time... hours spent to save seconds? just something to think about.





As I said its for the olympiad and Im participating in it, (actually its tommorow) , so you need super good execution time.
I saw. I think what you have is fine, you should probably be working on other areas now.
If you want a better overall execution time, read the file in a thread before you need it while you initialize and do other tasks.

And, good luck!
Last edited on
I get that you are trying to squeeze every last drop of performance out of your code so that you do well in the olympiad. I guess what others are saying is that in the real world, sometimes a few msecs of performance is not an issue. But the olympiad is not the real world.

Another thing you will learn in the real world is you need to use a style that is easier for others to read and understand. The following rewrites make your code a little bit easier to understand without hurting performance.

original:
if (pos == BUF_SIZE) fread(buf, BUF_SIZE, 1, fin), pos = 0;

rewrite:
1
2
3
4
5
if (pos == BUF_SIZE) 
{
    fread(buf, BUF_SIZE, 1, fin);
    pos = 0;
}


original:
while (!isdigit(ch = nextch()));

rewrite:
1
2
3
4
5
while (!isdigit(ch = nextch()))
{
  /*no-op - just consume the character */
  ;
}


This one (semi-colon at the end of a while statement) is actually an example of a common coding mistake and should be flagged in a reasonably thorough code inspection. Expressly commenting the no-op explains to the casual reader that the behavior is intentional.


However, @helios stated the following:

One problem I see with this implementation is that there's no checking to see if the size of the input file is a multiple of 217.


Regardless of whether the file is a multiple of 217, I don't see how you know that the file is exhausted. If you have read the entire file and continue to call read(), you will:

- continue to read past the last byte read during the last fread until pos = 217
- then, call fread (which will fail), reset pos = 0, and keep re-reading the bytes in the array.

There is nothing in your code to determine if your fread was successful and how many bytes were read (checking the return value from fread tells you that). And there is no way for read() to indicate that it did not generate another integer value.
Actually, a file smaller than 2^17 bytes gives the program undefined behavior, and all files larger consisting exclusively of digit characters causes it to hang.
fread() itself is buffered so you're actually making two copies. You might find that read() is faster.

The fastest I/O that I'm aware of is achieved my memory mapping the file to the program's address space.

I try to make any sequential file processing program I/O bound. Since accessing mechanical disk is around 1 million times slower than accessing RAM, if your program isn't I/O bound, it's a red flag that you have a performance problem.

Good luck at the olympiad!
Topic archived. No new replies allowed.