access large text file with index

How can i read large text file containing numbers with its index without first reading the hole datas? Is their any other method where i can directly save the datas to memory and than directly access it with its index like an array? If no how can i atleast read the datas very fast, the size of datas can are 300 mb to 1 gb. so reading the hole file fist is not suitable for performance.
Last edited on
What do you know about the format of the file?
You said "contains numbers", but what is between them?
You said "index". Does the file contain N numbers and you want to read the 42nd number?
actually not in too detail, but it seems like my problem is little different. what i want is to first write a multidimensional array in a file somehow and than read the value of the array or map by using the index of each dimension. for eg. to write:
1
2
3
4
5
6
 rosbag::Bag bagr("test.bag", rosbag::bagmode::Write);
    for(int i=0;i<10;i++)  {
        for(int j=0;j<y10;j++) {
            xo.data = zw[i][j];  bagr.write("numbers", ros::Time::now(), xo); 
    }
    bagr.close();


so now their r 10*10 int datas in the file test.bag. now i want to read the vale of any array with its index i and j instead of first reading hole file first. Here the size of each block is fixed and the datas are present in blocks.
Last edited on
Use read() and write() to read/write the numbers in their native format. This way they will be fixed size and also you'll save the time of converting to/from text.

Since they are fixed size, you can calculate the exact position of each number in the file. Seek to that position and read it.

An even better alternative is to memory-map the file. See the mmap() system call for details.
Assuming you need to get more than a tiny number of records from the file (say, an on-going and interactive query/response session??)

That file is not really that big in today's world. You can read the whole file into one big char array if you want, then come up with an efficient way to parse it. strstr is very fast. If you are in a hurry, you can split the buffer (carefully, based off your knowledge of the file format) and put 1 buffer on 1 thread for each core/cpu in your system.

If you want to get even faster/fancier, you can do something like hash part of each record into a SHA-1 or similar value and boil that down into a O(1*) lookup table. This will cost you, say, 10 seconds or less on the front end to load the file and prep it, and then you will have nanosecond response for the duration of the program after that. It sounds like you already have the hash key, if you convert it back from string to number format, but you will still have to do the initial load of the table.

* It might be wise to allow 10 or so records to share 1 hash key, just in case, so you might need to iterate over 0-10 or something small to find a record.


Last edited on
If I have many large objects (say a hundred plus variables per object) I sometimes make a key-file where I have an identifier and the tellg/seekg position of where it was initially written. When you need a specific object look it up in the key-file then jump to that position in the large file. The bigger the objects themselves, the more benefit you'll get from this method.

How experienced are you with binary read/write? It could speed up this process.

The downside here is more complexity on your side, and another point where the accuracy of your formatting is important. Always remember to do your error checking.

EDIT:
Haha, original post was in December, maybe Dinesh999lama could tell us what they actually used...
I think the overall design idea of what I describe above is comparable to a hash map or a hash table although the hash function is tellg/seekg...

Is your file sorted in any kind of manner? If alphabetical for instance, make a key for where aa ab ac ad af etc. are all first encountered in the file. Or every time a number increases by 100 in value... that kind of thing.
Last edited on
I don't know if this is the fastest method for reading the whole file (if you're stuck with that option) but it should be pretty efficient.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <fstream>
#include <sstream>
using namespace std;
int main()
{
    ifstream file("fileName.txt");
    stringstream stream;

    if(file.good())
    {
        stream << file.rdbuf();
    }
    // then read from this stream object like it was a cin stream.
}

Does anyone have an opinion of faster ways to read a single large file?
I'm pretty sure seekg(pos) can still be used with this method if you know the data can't be before some point in the file...


Edit:

Thanks, dhayden is right, I haven't dealt with files large enough to fill RAM and swap before. Not a good situation. Full reboot at least, and I was probably lucky not to corrupt my filesystem.

Technically I was trying the opposite of what I said, pushing text into a sstream to push into a file... Same idea in the issues department, I overflowed RAM and my console got pushed into swap so I couldn't interact with it while the program kept running the loop... bad.

I thought that the kernel would have a maximum allowed memory allocation limit. Nope.
Last edited on
If you need to read it many times, you could parse it once and create an index file. Then you could read that to determine the seek position for the record you actually want in the text file.
// then read from this stream object like it was a cin stream

But if you're going to read from the stringstream then why go to the overhead of creating it in the first place? Why not just read directly from the ifstream?

When you read the file into a stringstream, it has to read the entire file into RAM. If the file is large, then this could be a problem. Also, the string has to grow dymanically. I believe this typically means that when it runs out of space, it allocates space for a string about 50% larger, then copies the old string to the new position. If the string is large then these copy operations become slow.

Does anyone have an opinion of faster ways to read a single large file?

If you can read the file in large chunks, use read() and write(), to avoid the overhead of buffering. If your OS supports it, you can use mmap() to memory-map the file into your address space.
OK, I'm jumping on the mmap bandwagon, especially if your data is not static or requires sorting;

http://stackoverflow.com/questions/9076049/adding-data-at-beginning-of-file

It's either that or chop the file up in a meaningful manner. Since I don't know what's in the file I can only make suggestions like chronological (when user data was created), by first two-three letters of client's name, by area code/geographical location, etc. If there's no need for there to be a giant file then don't make it one.
Last edited on
Topic archived. No new replies allowed.