Overflow-Error

I've a text of about 5000 words and want to count words and word lengths looking only at alphabetic characters.
I tried to look at each char per string deciding if its ascii-code was in the alphabetic range, which worked on a small example but threw me an "Segmentation fault: 11" - error when running the compiled program. It seems to be a stack-overflow. Now I tried it using the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  
#include <iostream>
#include <cmath>
#include <string>
#include <fstream>
#include <vector>

vector <int> ascii;
for (int i = 65; i < 91; i++) ascii.push_back(i);
for (int i = 97; i < 123; i++) ascii.push_back(i);

while(istr >> word)
{
//Eliminieren von Zeichen anders, als Buchstaben				
  for (int i = 0; i < word.length(); i++)
	{
	 //Ist der ASCII-Code dieses Zeichens jener eines Buchstaben?
	 if(find(ascii.begin(), ascii.end(), (int) word[i]) == ascii.end()) word.erase(i,1);

	}
...


That results in the same overflow. Is there a more efficient way?
Thanks a lot!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <string>
#include <cctype>
#include <fstream>
#include <vector>

std::string strip_non_alpha( const std::string& str ) // reduce to only alphabets
{
    std::string result ;
    for( char c : str ) if( std::isalpha(c) ) result += c ;
    return result ;
}

std::vector<std::string> get_words_alpha( std::istream& stm )
{
    std::vector<std::string> result ;

    std::string str ;
    while( stm >> str ) result.push_back( strip_non_alpha(str) ) ;

    return result ;
}
worked on a small example but threw me an "Segmentation fault: 11" - error when running the compiled program

The code that you posted looks okay, but I see that it isn't the entire program. I suspect that the crash came somewhere else in the code. Can you post the whole thing?
Is there a more efficient way?

There are several things that could improve your code:
1
2
for (int i = 65; i < 91; i++) ascii.push_back(i);
for (int i = 97; i < 123; i++) ascii.push_back(i);

While this works, it's hard to understand. What are the magic numbers 65, 91, 97 and 123? The code would be much clearer if you wrote it like this:
1
2
for (int i = 'A'; i <= 'Z'; i++) ascii.push_back(i);
for (int i = 'a'; i < 'z'; i++) ascii.push_back(i);


Also consider this:
find(ascii.begin(), ascii.end(), (int) word[i])
To execute this code, the program does:
1
2
3
4
5
6
7
8
9
Is word[i] equal to 'A'?  No?  Hmm.
Okay, is word[i] equal to 'B' then?  No....
What about 'C'?  Is it 'C'?  Dang.
It's gotta be 'D'.  Is it 'D'?  Rats!
...
Sigh.... okay then 'Z'. It must be 'Z' right?  Oh boy.  This is getting old.....
Alright then, maybe it's 'a'.  Is it 'a'?
...
etc. etc. until it find the letter


A more efficient way to determine if a character is a letter is to check if it's in the range, sort of like the code that populates the ascii table:
if (word[i] >= 'A' and word[i] <= 'Z' or word[i] >= 'a' && word[i] <= 'z') ...

Since the set of all ascii characters is small, an even faster way would be to create a vector<bool> that is indexed by the ASCII value of a character, and returns whether the character is a letter. Then you could do something like this:
if (isletter[word[i]]) ....
It turns out that such a thing already exists in the C library. It's called isalpha():
1
2
3
#include <cctype>
...
if (isalpha(static_cast<unsigned char>(word[i])) ...

You are probably asking what's with that cast to unsigned char? Isalpha()'s parameter is an ASCII value expressed as an integer. But the type char can be signed. Meaning that values might be in the range -128 to +127. You want to be sure that you pass a value in the range 0 to 255, so you need to cast the char to an unsigned char.

There's also a bug in the way that you're using word[i].erase(). Consider when word contains the string "?!ABC". The first time through the loop you find that '?' isn't a letter, so you execute erase, changing word[i] to "!ABC". So far so good, but then you increment i to 1. So the next time though the loop you check word[1], which is now 'A'. As a result, you've mistakenly left in a non-alpha character.

There's also efficiency to consider. If word[i] is "!@#$%^ABCDE" then the that first call to erase() must copy "@#$%^ABCDE" (10 characters) over to the left. The next time (assuming that you first fix the above bug), it would copy "#$%^ABCDE" (9 characters) over. Then 7, then 6 and finally 5. That's a lot of copying. It might be faster to build up a second string from scratch using only the alpha characters. This method is easy to understand it avoids the above bug:
1
2
3
4
5
6
        string word;
        for (auto c : str) {
            if (isalpha(static_cast<unsigned char>(c))) {
                word += c;
            }
        }


This program puts it all together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <string>
#include <cctype>

using std::cin;
using std::cout;
using std::string;

int
main()
{
    string str, word;
    while (cin >> str) {
        string word;
        for (auto c : str) {
            if (isalpha(static_cast<unsigned char>(c))) {
                word += c;
            }
        }
        cout << str << " shrinks to " << word << '\n';
    }
}

To be pedantically correct, use std::isspace in <locale>, with the locale associated with the input stream.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <string>
#include <locale>
#include <vector>

std::string strip_non_alpha( const std::string& str, const std::locale& loc = {} ) // reduce to only alphabets
{
    std::string result ;
    for( char c : str ) if( std::isalpha( c, loc ) ) result += c ;
    return result ;
}

std::vector<std::string> get_words_alpha( std::istream& stm )
{
    std::vector<std::string> result ;
    const std::locale& loc = stm.getloc() ;

    std::string str ;
    while( stm >> str ) result.push_back( strip_non_alpha( str, loc ) ) ;

    return result ;
}


If the classic locale ("C" locale) must be used, and we need to take care of characters that are not limited to the ones in the basic execution character set, the usual (and more readable) version is:
for( unsigned char c : str ) if( std::isalpha(c) ) result += c ;
I thank you all for the well educated answers.
Now that the problem is solved I wanted to present
the solution to you. First: The error "Segmentation fault: 11"
does not mean a stack overflow but the try by the program
to write where it is not allowed to.

Specifically, due to the non-alphabet-eliminator code,
word-strings with length 0 came to exist, then
the programm wanted to count the word by incrementing
the (length-1)th entry of my vector (line 48 in the code). -1th entry does not
exist et voila.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#include <iostream>
#include <cmath>
#include <string>
#include <fstream>
#include <vector>
#include <cctype>

using namespace std;

int main()
{
string tmp;
vector <int> wdct;
int n = 0, z = 0;
char e = 'e';

ifstream istr("05_Englisch.txt");

//Alibi-Kontrolle
if (!istr)
{
	cout << "Konnte 05_Englisch.txt nicht öffnen!" << endl;
	return 99;
}

while(istr >> tmp)
{ string word = "";				
 	for (unsigned char i : tmp)
 {
		if (isalpha(i)) 
	{
     word += i;
    }	
 }
//string word = tmp;

	if(word.length() == 0) continue;
  if (word.length() > wdct.size())
	{
		for(unsigned int i = 0; i <= (word.length() - wdct.size()); i++)
		{
			wdct.push_back(0);
		}
	}


//here lies the mistake without the continue command in the previous loop
	n = word.length() - 1;
	wdct[n]++;

	//Zählen der e's	
 for (int i = 0; i < word.length(); i++)
	{
		if(word[i] == e) z++;			
	}
 word = "";
}

double sum = 0;

for (unsigned int i = 0; i < wdct.size(); i++) sum += wdct[i];

vector <double> rel(wdct.size(), 0);

for (unsigned int i = 0; i < wdct.size(); i++) rel[i] = wdct[i] / sum;
cout << "Es sind " << sum << " Wörter im Text." << endl;

ofstream outf("wortzahl.dat");

//Alibi-Kontrolle (again)
if(!outf)
{
 cout << "Konnte wortzahl.dat nicht öffnen!" << endl;
 return 99;
}

for (unsigned int i = 0; i < wdct.size(); i++)
{
	outf << i + 1 << "\t" << wdct[i] << "\t" << rel[i] << endl;
}
	outf << "Es kommen " << z << " 'e' vor." << endl;

	return 0;
}


Thanks to everyone who answered me. It was really helpful.
Last edited on
Topic archived. No new replies allowed.