Binary file - eof() stops too early

Hi,

I use a binary file in my program:
1
2
ifstream trieFile;
trieFile.open("doc3.trie", ios::binary);


Then I send it to a function using the following command:
inWildcard(trieFile, "a*bc");

This is the declaration of the function:
int inWildcard(ifstream& trieFile, string word, bool oc = 0);

In this function there is a while loop:
1
2
3
4
5
6
7
int i =0;
while (!trieFile.eof())
{
  readBlock(trieFile, i);
  //...
  i++;
}


The binary file is written and read block by block (the code is pretty long, if you need more information - tell me).
The writing looks good from what I saw using "Hex Editor".
For some reason, after reading the 20th block from the file, eof() value becomes to true (and trieFile.tellg() is -1), and it breaks the loop.
There are more than 100 blocks in the file.

Why is that???
Last edited on
Something happens inside your readBlock() function or in the code you omitted at line 5. (I suspect the problem is in readBlock and not the omitted code, but can't be sure.) Without seeing this code, I can't tell you.
Actually, I used the "eof" loop in more two other functions, and the same problem showed up. I'm really not sure that the problem is in the code at line 5.
In this specific loop, there are only 2 lines which use the "trieFile", both of them are "readBlock" functions.

Here's the readBlock function:
1
2
3
4
5
6
Trienode* Triedoc::readBlock(ifstream& trieFile, int sn)
{
	trieFile.seekg(sizeof(triebuff)*(sn / 10));		// move read position to the next letter's block
	trieFile.read(triebuff, sizeof(triebuff));			// read a block
	return (Trienode*)&triebuff[sizeof(Trienode)*(sn % 10)];	// return the next letter's node
}


Trienode represents a node in a retrieval tree.
If you need anything else...
>trieFile.seekg(sizeof(triebuff)*(sn / 10));
- integer division returns an integer.
- ¿why do you need to seek every time?

> return (Trienode*)&triebuff[sizeof(Trienode)*(sn % 10)];
so `triebuff' can hold at least 10 `Trienode's, then you choose one of the 10
If you want a Trienode, ¿why don't you read a Trienode?


Also, ¿what do you do with the return value? in the next reading you fill the buffer again, the pointer still points to the same place but the content changed.
I don't seek every time, but I wanted to simplify the function. This is the full function:
1
2
3
4
5
6
7
8
9
Trienode* Triedoc::readBlock(ifstream& trieFile, int sn)
{
	if (sn < ((Trienode*)&triebuff[0])->nodeserialnr || sn > ((Trienode*)&triebuff[0])->nodeserialnr + 9)
	{
		trieFile.seekg(sizeof(triebuff)*(sn / 10));		// move read position to the next letter's block
		trieFile.read(triebuff, sizeof(triebuff));			// read a block
	}
	return (Trienode*)&triebuff[sizeof(Trienode)*(sn % 10)];	// return the next letter's node
}


The problem exists anyway (with 'if' or without it).

Also, ¿what do you do with the return value?


I assign it to a "Trienode*" object.
In the while loop this is the line which use it:
1
2
3
4
5
6
7
8
	int j = 0;
Trienode* pnode = readBlock(trieFile, 0);	// points to the root
while (!trieFile.eof())
{
	pnode = readBlock(trieFile, j);
	//...
	j++;
}

In the loop, I read node by node from the *.trie file. The just stops after reading the 20th node.
By the way, I don't know why, but from my checking, the trieFile.tellg() value is -1 along all the 3rd buffer. If I read the nodes number 20-29 it's -1. But when I read the 30th and on, everything works just fine.

in the next reading you fill the buffer again, the pointer still points to the same place but the content changed.


That's correct. But the function returns the node I need... (sn % 10).
Last edited on
Looping on eof() is almost always the wrong thing to do. This is because an error is not flagged until after the read tries to read past the end of the file.

The idiomatic way to read from a file is to put the read operation within the test condition of the while() loop like this:

1
2
3
4
while(std::getline(file, line))
{
    // line is good here because the read succeeded
}


Your code fails to check that the read was a success before using the value returned.

Your read function does not lend itself to the idiomatic usage because it does not return the input stream like the standard library read functions do. Therefore, in your case, I would suggest trying something like this:

1
2
3
4
5
6
7
8
9
10
11
12
while(trieFile) // test the stream not eof()
{
    pnode = readBlock(trieFile, j);
    
    if(!trieFile) // re-test the stream to make sure it succeeded
        break;
        
    // pnode should be good to go here
    // however even if the read succeeded
    // if there was a problem parsing the 
    // input you still should check for that failure
}
1
2
3
4
Treenode buffer[10];
while( trieFile.read( reinterpreter_cast<char*> buffer, sizeof(buffer) ) ){
   //operate with the nodes buffer[0] to buffer[9]
}
Last edited on
You said that the writing of the data is fine. When you read it, are you getting what you expect? Can you post the declaration of class Trienode?

What is the initial value of triebuff? The first time you call readBlock(), it appears that the code the checks the sn assumes that triebuff is populated with legal data.

Try adding some code after calling trieFile.read() to see if trieFile.good() is still true.
> Try adding some code
Try using a debugger
Can you post the declaration of class Trienode?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct Trienode
{
	// DATA:
	long int nodeserialnr;		// serial number of the vertex
	long int firstoffset;		// position of the first occurrence of the word in the text file
	long int nrofoccurrences;	// number of word's occurrences
	unsigned char letter;		// letter in ASCII code
	bool wordend;			// if this letter is the latest of the word

	// LINKS:
	long parentserialnr;		// a pointer to the parent node
	long links[256];		// array to point the next vertex by its serial number

	Trienode();
	~Trienode();
};


I just noticed that now, the problem exists only in specific document I indexed, I don't know why.

Anyway, while I'm trying to fix it, I noticed that after reading the last block/buffer of the *.trie file, the tellg() value turns to -1 and if I want to read a previous block (using readBlock), it just stays at the last one.

How can I set the seekg() to a previous position after it reached to the end of the file??
Last edited on
I think clear() fixed the problem I mentioned in my last post.

About your suggestions to replace "eof()" with another method, I have another question please:
Since the program reads block by block, when it reads the last block, the "eof()" (or any other suggestion you mentioned), set to true. But after reading the first node in the block, it breaks the loop.

I need your suggestions to another condition, so it will break the loop only after reading the last node in the file. Something like:
if (trieFile.eof() && last node in block)
or to find a way to know what is the last node serial number saved in the *.trie file.

Please note that the last block is not always full with relevant nodes.
If there is a number of nodes that not divided by 10, that last block will contain x relevant nodes and 10-x irrelevant blocks (leftovers from the last reading).

Thanks a lot!
Since the program reads block by block

I was going to mention this after you got the code working. Why are you reading blocks? You're doing a lot of buffering that the OS and library are perfectly willing to do for you.
Topic archived. No new replies allowed.