text file too big (EDIT: fail)

I wanted to make a "word of the day" program. I found a public domain version of webster's at gutenberg.org, and it was arranged in a simple-to-manage format. Each main word is fully capitalized, so that is a easy marker to start with.

The problem, if I'm right, is that the book is too long (28M), so the flags for ofstream stop working. tellg() stops sending information at a little past 28 million characters, so something is definitely broken in the ofstream class. I end up with an infinite loop in which file.eof(), file.good() and file.bad() won't stop it.

Is there a way to copy a file in blocks of data rather than letter by letter? If so, do you have a suggestion of how to turn those blocks into letters while in-program?

link to the dictionary - http://www.gutenberg.org/ebooks/29765

The code is simple enough, but it's on a raspberry pi so I'll post it in a little bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <string>
#include <iostream>
#include <fstream>

using namespace std;

// returns true if the first word of longline is all capital letters
bool allcaps(string longline)
{
	
	int i = 0;
	if(longline[i] >= 'A' && longline[i] <= 'Z')
	{
		while(longline[i] != ' ' && longline[i] != '\n')
		{
			if(longline[i] <= 'z' && longline[i] >= 'a')
			{
				return false;
			}
			i ++;
		}
	}
	if(i >= 4)
	{
		return true;
	}
	return false;
}

int main()
{
	ifstream file("dictionary.txt");
	ofstream file2("output.txt");
	string lines;
	bool run = true;
	unsigned long int oldg = 0;
	long newg = 0;
	while(file.good())
	{
		if(allcaps(lines))
		{
			file2 << lines << endl;
			getline(file, lines);
			while(!allcaps(lines))
			{
				file2 << lines << endl;
				getline(file, lines);
			}
		}
		else
		{
			getline(file, lines);
		}
		newg = file.tellg();
//		cout << newg << ' ';

                // to break the loop before tellg breaks
		if(newg > 28000000)
		{
			break;
		}
	}
	file.close();
	file2.close();
}
Last edited on
Have you considered breaking that large file up into a few smaller files?
I did, but I decided it would be more fun to try to work out the riddle...

I started this program out of boredom, but it's more of a challenge now; part of the game should be that I can't break the file up ahead of program run time.
Breaking it to pieces during run time is proving to be more difficult than I expected...

I haven't worked with files of such large size before, but I think it's something that I should know how to do.
Last edited on
My first suggestion would be to use the proper type for the function return types. For example what type of variable does tellg() return?

Next if your program is crashing be sure to run it with your debugger. The debugger will tell you exactly where it detects the problem and will allow you to view the variables at the time of the crash.



By the way I don't seem to have any problems reading the file.

My test program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <fstream>

using namespace std;

int main()
{
	std::ifstream file("Dictionary.txt");
	if(!file)
      return 1;
	string lines;
   long long newg = 0;

	while(getline(file, lines))
	{
	   newg++;
	}

	std::cout << newg << std::endl;

   return 0;
}


closed account (48T7M4Gy)
FWIW I tried using the gutenberg text file with a small line-reader program I wrote to tokenize text. It had no problem processing the whole file line by line, start to finish.

I didn't try your program on the file but I think you need to examine what your purpose, if any, is in newg, tellg etc.

I suspect you have an indexing problem in allcaps. In short I doubt whether streaming functionality is broken. What you need to do is see if lines displays as expected in each of main and then allcaps.



While I know there is a limit, I suspect the lower value is based on the file system, of course it's limited by disk as well, not ofstream. It should be in the Terabytes on most current OS's.

Just to test your theory, I modified your program, a lot, and it ran fine. Finished in a few seconds if you want to see for your self, its' below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <fstream>

using namespace std;



int main()
{
	int count =0;
	ifstream file("DICT.txt");
	ofstream file2("output.txt");
	string lines;

	while(file.good())
	{
			getline(file, lines);
			file2 << lines << endl;
			count++;
	}

cout << count;
	
	file.close();
	file2.close();
}
closed account (48T7M4Gy)
That's the way Samuel.

Since it's only reading a line at a time the only limitation will be disk storage which would have shown up separately as a disk full error or somesuch. If the line is read in as a string then there is no problem with memory.

What purpose does newg and tellg serve?

This construct is fundamentally wrong
1
2
3
4
5
6
while( file.good() ) 
{
   std::getline( stm, line ) ;
   
   // do something with the line that was read
}

It fails to check for input failure after the actual attempt to read something.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <iostream>
#include <string>
#include <sstream>

int main()
{
   std::string test_str ;
   for( int i = 0 ; i < 3 ; ++i )
   {
       auto n = std::to_string(i+1) ;
       test_str += n + ". This is line #" + n + '\n' ;
   }

   {
       std::istringstream stm(test_str) ;

       std::string line ;
       int line_count = 0 ;

       std::cout << "-----  bad  -----\n" << stm.str() << "----------------\n" ;
       // check for input failure *before* an attempted input
       while( stm.good() )  // ****
       {
           std::getline( stm, line ) ;
           ++line_count ;
       }
       std::cout << line_count << " lines were read\n" ;
   }

   {
       std::istringstream stm(test_str) ;

       std::string line ;
       int line_count = 0 ;

       std::cout << "\n-----  good  -----\n" << stm.str() << "----------------\n" ;
       // check for input failure *after* an attempted input
       while( std::getline( stm, line ) ) // ****
       {
           ++line_count ;
       }
       std::cout << line_count << " lines were read\n" ;
   }
}

http://coliru.stacked-crooked.com/a/968c8029be79703b
Thank you everyone.

I am writing the program for a raspberry pi2 which has a 32bit cpu instead of the 64 that most modern computers would have, so I thought that streampos would have a small enough limit for the dictionary's letter count to hit.

NOPE.

The problem is my nested while loop in main does not have an exit, it goes on reading until it finds the next key word and of course the dictionary is not going to end on a keyword so that is where the infinite loop was occurring.

(Smacking forehead)

Thanks for the suggestions, I'll put them into effect.
Last edited on
Topic archived. No new replies allowed.