Large file and float arrays - Stack overflow problem

Hi folks,

I am quite new in C++.

I need to read a large file (a 5 column float array with ~500,000
lines) and further use this array along my code in order to do some
arithmetic operations. Because the array is too large, I cannot even read my file due to stack overflow of memory. Could anyone help me to solve this
problem, please?


This is what I was doing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <fstream>
#include <iostream>
#include <string>
#include <stdio.h>
#include <math.h>
#include <vector>
#include <stdlib.h>

using namespace std;

/*Called Functions*/
float getFileNLines();

/************************************************************/
int main()
{

  int i, j;

  /*Opening files and defining columns**********************/
  int n, ncol;
  n=getFileNLines();
  ncol=5;

  ifstream file ("myfile");
  double matrix[n][ncol];

  if(file.is_open()){
    for(i=0; i<n; i++){
      for (j=0; j<ncol; j++){	
	file >> matrix[i][j];
      }
    }
  }
  else
    cout << "Unable to open the  myfile! \n";
  
  return 0;
}

/*************************************************************
Called functions
************************************************************/

float getFileNLines()
{

   ifstream file ("myfile");
   string line;
   vector<string> lines;
   while( getline( file, line ) ) lines.push_back( line );
   float num;
   num=lines.size();

   return num;
}

Last edited on
I think the problem is because you're using a vector to store all those strings when really all you need to do is create a counter.

1
2
3
int counter;
for (counter = 0; !file.eof(); counter++)
	getline(file, line);
Last edited on
The problem is that you're not using vector.
This line is not even valid C++, as array dimensions must be compile-time constants:
double matrix[n][ncol];
If you have that much data and store it in an array or vector, even if your program works, it will take forever to do anything because those data structures are too simple for what you need.

For large amounts of data you need a tree structure that works in O(log) time.

I am not sure what ADT multimap (mmap) uses internally, but the documntation lists the Big O times for each function such as insert find etc

To use the map object you need to have a key or identifier to work with. If your input file is say, co-ordinates with no point numbers, then that might not work very well directly - you would have use code to provide the key.

Good luck
Hi people,

Many thanks for sharing your thoughts, but since I am new in the business I believe I need to study more to understand what you are actually suggesting.

:-/

I still did not manage to make my code run. If (when) I find the solution I come back to post it here. Thanks anyways.
Last edited on
I finally solved the problem like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include<iostream>
#include<vector>
#include<string>
#include<fstream>

using namespace std;

/************************************************************/
int main(){

  ifstream file;
  file.open("myfile");  //myfile contains ~500,000 lines x 5
			//columns. My data are basicaly float numbers

  vector<double> myAvector;
  vector<double> myBvector;
  vector<double> myCvector;
  vector<double> myDvector;
  vector<double> myEvector;

  double a, b, c, d, e; // variables used to read the lines

  while(! file.eof()){ 
    file >> a >> b >> c >> d >> e;  
    myAvector.push_back( a ); 
    myBvector.push_back( b ); 
    myCvector.push_back( c ); 
    myDvector.push_back( d );
    myEvector.push_back( e ); // push_back adds a new element at the
			      // end of the vector			
  }

  //for accessing elements
  for (long int i=0; i < myAvector.size(); i++){
    cout <<  myAvector[i] << " " <<  myEvector[i] << endl;
  }

  file.close();

  return 0;

}


I don't know if it is clever enough, but as I said I am start learning C++.
coolgalaxy wrote:
1
2
while(! file.eof()){ 
    file >> a >> b >> c >> d >> e; 

Whatever source taught you this, throw it away before you learn more like this.

The correct loop is
1
2
while(file >> a >> b >> c >> d >> e){ 
     

although any other loop that checks the I/O errors *after* attempted I/O would be correct

(your first post had a good loop, though: while( getline( file, line ) ) )
Last edited on
Is your data spatial data ie coordinates?

If it is then, find out about RTrees, which are rectangles within rectangles solution.

Because of your volume of data, you are going to need some sort ADT tree, otherwise your program might sit there for an hour, then run out memory.

I am thinking about doing a project that deals with LIDAR info, with data sets that could be 5 million 3d points.

I am interested to hear about your data.

With your code above, why do you have 5 vectors? What you are doing is creating a vector for each column of data.

You could have 1 vector with 5 doubles in it, which would be 1 row of info. This vector could then go into whatever data structure you decide to use.

Of course you might have a valid reason for doing it this way.

Cheers

Hey Cubbi...

Do you mind telling me why the loop

1
2
while(! file.eof()){ 
    file >> a >> b >> c >> d >> e; 


is bad?

Thank you.
Hi TheIdeasMan!

TheIdeasMan wrote:

I am interested to hear about your data.


I am studying astronomy and I have a catalog which contains information on several thousands of galaxies. Each row contains data of one specific galaxy. For instance, coordinates (x, y), the brightness, the mass, the size, etc. Basically, only columns of float numbers.

TheIdeasMan wrote:

With your code above, why do you have 5 vectors? What you are doing is creating a vector for each column of data.


Yes. Is this bad? Sorry, I am still learning...

The program I wrote here is simplified at most, this is not the actual code I will use. Here I focused on the problem of opening the file and storing my numbers.

TheIdeasMan wrote:

You could have 1 vector with 5 doubles in it, which would be 1 row of info. This vector could then go into whatever data structure you decide to use.


I am studying this now. When I manage to create my new code with that, I come back to post it here. It might take some day though...
Do you mind telling me why the loop
1
2
3
while(! file.eof()){ 
file >> a >> b >> c >> d >> e;  
    myAvector.push_back( a ); 
is bad?

It erroneously checks the end-of-file condition before attempting the input, which means, in your case, that it will read past the end of file and push_back five bogus values. It will also loop forever if the file format is incorrect, e.g. if it contains a letter. More importantly, it's a sure sign of a bad textbook/tutorial.

And yes, it's better to have a single container holding objects of some type that holds all your data, rather than managing five parallel arrays. Also, think about how the container will be used. If you're going to look up values based on coordinates, perhaps a map that uses the set of coordinates as the key and the rest of values as the value is best.
Last edited on
Ok, I would reccommend you use double rather than float, because it has 16 digits of precision, whereas float only has 8, which probably won't be enough for your app.

Also, when testing use a small dataset, like 1000 say. Get everything working with that, then test on the full set. Remember that with full data you are going to have to have some sort of tree structure to store it all in, other wise (as I said) your computer might sit there for an hour then run out of memory.

With that much data, it is not a trivial task that you are doing.

For now, (just to have a go), you could have several scenarios. (This is just for the small dataset, the full one will be completely different)

The first one is this: (it's a C approach)

Put the 5 values into a struct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct Galaxy {
double dXOrdinate = 0.0;  //put description here, units etc
double dYOrdinate = 0.0;  //put description here, units etc
double dBrightness = 0.0;  //put description here, units etc
double dMass = 0.0;  //put description here, units etc
double dSize = 0.0;  //put description here, units etc
};
struct Galaxy AGalaxy;  //creates a galaxy object named AGalaxy

//this shows how to deal with one struct object

AGalaxy.dXOrdinate = 100.0;  //initialisation
AGalaxy.dYOrdinate = 500.0;
AGalaxy.dBrightness = 2.0;
AGalaxy.dMass = 600.0;
AGalaxy.dSize= 200.0;

typedef struct Galaxy TheGalaxy;  //now we can use TheGalaxy as a type instead of struct Galaxy everywhere.

TheGalaxy GalaxyArray[1000];    //an array of galaxy structs, starts at index 0

//Now some psuedo code
double Field1 = 0.0;
double Field2 = 0.0;
double Field3 = 0.0;
double Field4 = 0.0;
double Field5 = 0.0;

// open the file with galaxy info in it
int counter = 0;
//while not end of file {
for (counter = 0; counter<1000; counter++)  {  //does 1000 lines worth
//read 1 line worth of info
//extract the 5 fields and put them into the struct
GalaxyArray[counter].dXOrdinate = Field1;
GalaxyArray[counter].dYOrdinate = Field2;  //similar for the other 3
}


Benefits of the C approach:

Performance & memory management will be a big thing here, there is less overhead with these C data structures as opposed to the C++ data structures such as vector and list.



Now the C++ approach

Have a vector of the 5 doubles, then make a <list> of these.


Now some other observations:

Galaxies are in 3D, so do we need a ZOrdinate ? Do we need a name for each one - I think so?
It could also be worth having a diatance to the galaxy, see below as to why

Now to the tree data structure:

Binary trees provide quick performance, because you can halve the data you need to search in each iteration. Google Binary Sort Trees.

In a BST the left child is less than the right child, so each time you ask a question, you are reducing the amount of remaining data to search through by half. This gives log order efficiency.

To do searching, you need something to search by, so this is why I think we need the distance field. The name is no good and neither are any of the others.

We can still keep our Galaxy struct, we just need to decide what Abstract Data Type to put it in.

We could program our own BST in C, but that would be reasonably involved.

In C++, I am not sure what ADT the STL's like map or multimap use, but they are supposed to be efficient. Multimaps allow non unique keys, which could be handy if the distances happened to be the same.

The trouble is, there might be still too much data evenif it is in a BST.

Let me explain what I might do with my LIDAR data.

I have 3.4 million 3D points which come from an airbourne scanner. They are spread over 70 sq km (7km by 10 km).

My proposal is to have map sheets which are 16km by 16km. These are made up of blocks which are 4km by 4km.

The blocks are made up of grid squares 1km by 1km. These are made up of tiles which are 250m sq.

All these are stored in a tree structure, it's not a BST because there are 4 child objects at each level.

The tile objects will store their points in a BST. The idea is that each tile will have it's points in an OS file, the files are only loaded if needed. and are removed from the tree if they aren't needed.

So maybe you could do something similar with your galaxy info. It's fairly involved, but as I said your problem is not a trivial one with that much data. The difference between your problem and mine is that yours is truly 3d, whereas mine has 2d rectangles that hold 3d info.

Good luck again

You don't want to store the data of large files. You want to be store locations in the file that hold the data you need.
I have had some more thoughts on this.

The easiest way would be to use a relational database with GIS extensions , such as PostgreSQL with PostGIS. This software uses RTrees & all kinds of other fancy stuff internally, so all one needs to do is load it all in, & Bob's your Uncle!!

However that's perhaps not so good for your assignment - so I was thinking it could go something like this design summary:

Divide your spatial data into tiles of a specific size. You have 500,000 points, so you could have a 10 x 10 tiles, giving about 5000 points per tile - assuming fairly constant denisty. From my vague knowledge of Astro, I am guessing that's true. Btw, good thing about RTrees, is they are better for discrete populations.

Have <vector> or struct, that stores info about tiles, a TileID (unsigned short 0 to 99), the bounding coords, the file where the points are, and any other stuff like number of points in the tile. Deciding between using a vector or struct depends on how convenient or whether you need to use the functions provided by <vector>. The Tile info is just a record so you could just use a struct.

Create a <list> of the structs, populate with TileId, bounding coords, and any other info that is known before dealing with the points.

If you are just going to store the galaxy name and TileId, you could get away with storing all 500,00 in a Binary Sort Tree. I am hoping a <map> will be efficient enough. This is going to act as an index.

Next is some some logic to decide which tile a point goes in, easy because you know the origin coords of the bounding square that the tiles are in.

Now read in the points, write them to the appropriate file, update the Index BST, update the info in the TileList.

For functions like find, you need another BST ( or <map> will be easier)to hold the 5000 points from the file. (FindBST or FindMap)
Search the Index BST for the name of the Galaxy, this gives you the TileId & FileName. Load the file into your FindMap. Then use the find <algorithm> to return the point in question.

You could have several temporary <map> objects if required, but you should manage the memory (get rid of them if they aren't necessary.)

Hope this all helps.

LowestOne wrote:

You don't want to store the data of large files. You want to be store locations in the file that hold the data you need.


Yes exactly. Thanks for helping me with the vocabulary.
Hey TheIdeasMan! Thank you! You provided me with great inputs. I will come back soon to comment on them... 'cause now I am quite busy with final exams.
No worries coolgalaxy - Good Luck !!
There are two considerations, I think. One is searching by point name, the other is displaying the points graphically. For the display function it is a very good idea to keep the Tile model, because it is better to display a number of small chunks, rather than big blocks of data.

I was thinking if there is going to be a memory / performance issue to do with the size of a data structure like a BST, then maybe it's worth considering the following:

Imagine you had to store the names & phone numbers for 5 million people !!

One way would be to have a BST for each letter of the alphabet. Then you have code to determine which letter the persons surname starts with, so store their name & number in that particular BST. You could call this a master index.

This way you have divided the problem by 26 straight up.

I was thinking you could do something similar with the galaxy names index. There could be a problem with distribution of names, so you split the sorted names list up arbitrarily and keep track of the begin & end name for each index.

You could store this info into a master index BST that stored 100 say name ranges. The name ranges are in a BST because this is more efficient than a list. This would allow reading from a file (5,000 points say) into a temporary BST, that stored ALL the info about the point.

This method gets around the maximum size limit for a data structure stored in memory. However you still have to manage the memory a bit by not having too many of the 5,000 point BST's in memory at once.

So, there is a need to have two lots of 5,000 point files (200 files total), one for sorted point names, the other for points in a Tile. The files could be made smaller by storing in binary format, a double is 53 bits I think.

From the Tile info file point of view, you could have many more tiles than 100, because the number of Tile files is independent of the number of name ranges. This is because the TileId is stored with each point.

I was thinking I would have to do something similar if I were to do the LiDAR project.

Any was I wish you more good luck with your exams !! :D
Topic archived. No new replies allowed.