Address boundary error, while reading a file

hi every I received a bug when I reading a file.
I want to read a file, which more than (1 << 17) lines.
The file is somehow like that :
1|2|3|4|....
1|2|3|4|....
1|2|3|4|....
1|2|3|4|....
From that I only want the first 4 columns, so read SIZE of lines then save the first 4 columns into 4 vectors, which contained in a vector of vector.

The problem is that when I use SIZE = 1 << 16, it works
But when SIZE becomes to 1 << 17 or greater, then I got Address boundary error, which 100% is the fault an vector. Segmentation fault (core dumped)
Do you find the error?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

const char * const TPCH_LINEITEM_PATH = "Desktop/tpch-dbgen/lineitem.tbl";
constexpr uint64_t SIZE = 1 << 17;

// this function is not buggy, which I checked
std::vector<std::string> split(std::string strToSplit, char delimeter) {
  std::stringstream ss(strToSplit);
  std::string item;
  std::vector<std::string> splittedStrings;
  while (std::getline(ss, item, delimeter)) {
    splittedStrings.push_back(item);
  }
  return splittedStrings;
}

int main(int argc, char **argv) {
  vector<vector<uint64_t>> table_portion(4);
  auto &table0 = table_portion[0];
  table0.reserve(SIZE);
  auto &table1 = table_portion[1];
  table1.reserve(SIZE);
  auto &table2 = table_portion[2];
  table2.reserve(SIZE);
  auto &table3 = table_portion[3];
  table3.reserve(SIZE);
  std::ifstream in(TPCH_LINEITEM_PATH);;
  if (in.is_open()) {
    std::string lineFromText;
    uint64_t i = 0;
    while (i < SIZE && std::getline(in, lineFromText)) {
      std::cout << "i " << i << "  ";
      auto spilte_vec = split(lineFromText, '|');
      table_portion[0].push_back((uint64_t)std::stol(spilte_vec[0])); std::cout << spilte_vec[0] << "  ";
      table_portion[1].push_back((uint64_t)std::stol(spilte_vec[1])); std::cout << spilte_vec[1] << "  ";
      table_portion[2].push_back((uint64_t)std::stol(spilte_vec[2])); std::cout << spilte_vec[2] << "  ";
      table_portion[3].push_back((uint64_t)std::stol(spilte_vec[3])); std::cout << spilte_vec[3] << "  " << std::endl;
      i++;
    }
    in.close();
  } else {
    std::cout << "open error" << std::endl;
    exit(1);
  }
//.... 
Last edited on
This file has more than 1 << 17 . Even more than 1 << 20 lines. So the SIZE I chose, is within the range of file.
do you have enough free, contiguous ram locations to store that many push-backs?
I suspect its not the FILE, but that your vector has run out of room.
to test this, eliminate the file entirely, and push back that many records (just push back the same test data record hard-coded that many times) to see if the vector is actually the issue.

remember that vectors require the memory to all be one solid block... like arrays...
Hmm. REBOOT
Thanks.
This file has more than 1 << 17 . Even more than 1 << 20 lines. So the SIZE I chose, is within the range of file.

That statement makes no sense. If the files has more than 1<<20 (1,048,576) lines that's not within the range of SIZE. That really doesn't make any difference though since std::vector will continue to resize each vector as you push more than 1<<17 (65536) rows.

I hope you realize you're trying to reserve more than 2MB of memory (65536 * 4 * 8).
And if you're trying to read 1,000,000 rows, you're going to need more than 32MB of memory.

EDIT: Keep in mind that each vector requires overhead for every nested vector. This is typically 16 bytes. I didn't include this in my calculations above. For 1,000,000 rows, this comes out to more than 100MB.
Last edited on
Hmm. Yes. I should run as less program as possible to save my RAM.
Please post a small sample of your input file.

Here's a cleaned up version of your program which requires significantly less memory by using a struct instead of a nested vector. It also avoids reading more columns than necessary when reading each row.

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
struct
{   int64_t num[4];
} Row;

Row get_row (const std::string & line) 
{  std::stringstream ss(line);
    std::string item;
    Row row; 
    for (i = 0; i < 3; i++)
    {   int64_t     num;
        std::getline(ss, item, '|');
        num = std::stol(item);
        row.num[i] = num;
    }
    return row;
}

int main(int argc, char **argv) {
    vector<Row> table(SIZE);    // C++11    
    std::ifstream in(TPCH_LINEITEM_PATH);;
    if (in.is_open()) {
        std::string line;
        uint64_t i = 0;
        Row row;
        while (i < SIZE && std::getline(in, line) 
        {
            row = get_row(line); 
            std::cout << "i " << i << "  ";
            table.push_back(row); 
            std::cout << row.num[0] << "  ";
            std::cout << row.num[1] << "  ";
            std::cout << row.num[2] << "  ";
            std::cout << row.num[3] << "  " << std::endl;
            i++;
        }
        in.close();
    }
    else {
        std::cout << "open error" << std::endl;
        exit(1);
    }


Assuming 1,000,000 rows, this will still require 33.5 MB.
Last edited on
Sorry for being late
Data: TPC-H Lineitem
1
2
3
4
5
6
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|

I just take a few of them, except the 3 strings at last. And convert all the numbers to uint64_t
Hope this helps!
Last edited on
Then why bother with creating strings in your subprogram. Just read the entire line with getline() then use a stringstream to parse the line into the proper type of variable.

Something like the following (not tested):
1
2
3
4
5
6
7
8
9
10
11
12
13
std::vector<uint64_t>get_values(std::string strToSplit, int num_itmes) 
{
  std::stringstream ss(strToSplit);
  std::vector<uint64_t> items;
  uint64_t value;
  char delimiter;
  for(int counter = 0; counter < num_items; ++counter)
  {
       if(ss >> value >> delimiter)  // Make sure conversion successful.
          items.push_back(value);  // Using push_back() so it is easy to tell if all values were successfully read.
  }
  return items;
}


Note the calling function should check if the returned vector has the proper number of values, if not then the line had some bad data.

Also unless you're happy with the program "crashing" if the call to stoX() (in your original code) fails you should consider using a try/catch block to handle the failure.

Last edited on
Thanks all.
I don't know, it is not a bug
But it is not a good system programming habit to have so many vector copy.
What I do now,
is only having one vector,
resize or reserve at the very beginning and
using the reference to enlarge it.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
  vector<vector<vector<uint64_t>>> table_portions(NUM_DB, vector<vector<uint64_t>>(COLUMNS));
  for (auto &vec : table_portions) {
    for (auto &v : vec) {
      v.resize(imlab::TUPLES_PER_DATABLOCK);
    }
  }
  getTablePortions(table_portions);

void getTablePortions(vector<vector<vector<uint64_t>>> &table_portions) {
  std::ifstream in(TPCH_LINEITEM_PATH);
  if (in.is_open()) {
    std::string lineFromText;
    uint64_t i = 0;  // lines read
    size_t current_num_db = 0;
    size_t vec_offset = 0; 
    while (i < SIZE && std::getline(in, lineFromText)) {
      auto spilte_vec = split(lineFromText, '|');
      auto &table_part = table_portions[current_num_db];
      auto &table_part_0 = table_part[0];
      auto &table_part_1 = table_part[1];
      auto &table_part_2 = table_part[2];
      auto &table_part_3 = table_part[3];
      auto &table_part_4 = table_part[4];
      auto &table_part_5 = table_part[5];
      auto &table_part_6 = table_part[6];
      auto &table_part_7 = table_part[7];
      auto &table_part_8 = table_part[8];
      auto &table_part_9 = table_part[9];
      auto &table_part_10 = table_part[10];
      auto &table_part_11 = table_part[11];
      auto &table_part_12 = table_part[12];

      table_part_0[vec_offset] = ((uint64_t)std::stol(spilte_vec[0]));  // int
      table_part_1[vec_offset] = ((uint64_t)std::stol(spilte_vec[1]));  // int
      table_part_2[vec_offset] = ((uint64_t)std::stol(spilte_vec[2]));  // int
      table_part_3[vec_offset] = ((uint64_t)std::stol(spilte_vec[3]));  // int
      table_part_4[vec_offset] = ((uint64_t)(std::stof(spilte_vec[4]) * 100));  // float with 2 point number
      table_part_5[vec_offset] = ((uint64_t)(std::stof(spilte_vec[5]) * 100));  // float with 2 point number
      table_part_6[vec_offset] = ((uint64_t)(std::stof(spilte_vec[6]) * 100));  // float with 2 point number
      table_part_7[vec_offset] = ((uint64_t)(std::stof(spilte_vec[7]) * 100));  // float with 2 point number
      table_part_8[vec_offset] = ((uint64_t)spilte_vec[8].at(0));  // char
      table_part_9[vec_offset] = ((uint64_t)spilte_vec[9].at(0));  // char
      table_part_10[vec_offset] = ((uint64_t)parseDate(spilte_vec[10]));  // date
      table_part_11[vec_offset] = ((uint64_t)parseDate(spilte_vec[11]));  // date
      table_part_12[vec_offset] = ((uint64_t)parseDate(spilte_vec[12]));  // date
//...
}
Topic archived. No new replies allowed.