Letter Occurences

https://en.wikipedia.org/wiki/Letter_frequency

Hello. I am working on a simple program so as to check letter occurrences in a text (take a look at the link above). I use the STL in order to populate maps and to sort all data at the end. The last part checks maps so as to find the best and least letter occurrence. However I would like to simplify this last part with simple Templates (generic functions). I know how to compare data using templates, but I have no clue how I should return the right letter according to its integer - the best and the least occurrences. Do you have an idea ? Thank you for your help. I wish you the best ++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include <iostream>
#include <map>
#include <fstream>

using namespace std;

ifstream inputFile("main.cpp");
string input((istreambuf_iterator<char>(inputFile)), istreambuf_iterator<char>());
// templates which could return biggest and smallest integer
template <typename T>
T& myMax(const T& x, const T& y)
{
    return const_cast<T&>(x > y ? x : y);
}

template <typename T>
T& myMin(const T& x, const T& y)
{
    return const_cast<T&>(x < y ? x : y);
}

int main()
{
    map<char, int> occurrences;
    map<int, char> flipped;

    for (auto& c : input) c = (char)toupper(c);
    input.erase(remove_if(input.begin(), input.end(), [](char c) { return !isalpha(c); }), input.end());
    
    for (string::iterator character = input.begin(); character != input.end(); character++)
    {
        occurrences[*character] += 1;
    }

    for (map<char, int>::iterator entry = occurrences.begin(); entry != occurrences.end(); entry++)
    {
        cout << entry->first << " = " << entry->second << endl;
    }

    cout << endl;

    for (auto i = occurrences.begin(); i != occurrences.end(); ++i)
        flipped[i->second] = i->first;

    for (map<int, char>::iterator entry = flipped.begin(); entry != flipped.end(); entry++)
    {
        cout << entry->first << " = " << entry->second << endl;
    }

    cout << endl;

    int max = 0;
    int min = INT_MAX;
    char cMin, cMax = 'A';
    // part which can be simplified by templates
    for (auto cc = flipped.begin(); cc != flipped.end(); ++cc)
    {
        if (cc->first > max) {
            max = cc->first;
            cMax = cc->second;
        }

        if (cc->first < min) {
            min = cc->first;
            cMin = cc->second;
        }
    }

    cout << "Predominant letter : " << cMax << " with " << max << " occurences" << endl;
    cout << "Marginal letter : " << cMin << " with " << min << " occurences" << endl;
}



A = 60
B = 8
C = 94
D = 29
E = 100
F = 28
G = 10
H = 16
I = 76
L = 24
M = 38
N = 90
O = 43
P = 38
R = 85
S = 41
T = 103
U = 37
V = 1
W = 2
X = 15
Y = 20

1 = V
2 = W
8 = B
10 = G
15 = X
16 = H
20 = Y
24 = L
28 = F
29 = D
37 = U
38 = P
41 = S
43 = O
60 = A
76 = I
85 = R
90 = N
94 = C
100 = E
103 = T

Predominant letter : T with 103 occurences
Marginal letter : V with 1 occurences
Last edited on
No idea why you want templates. Such things as minmax_element already exist.

Your frequency table will miss letters with no occurrences and also give limited output when there are more than one min-frequency or max-frequency letter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <iterator>
#include <cctype>
using namespace std;

int main()
{
   const int N = 26;
   int occurrences[N] = { 0 };
   
   ifstream inputFile( __FILE__ );
   string input = string( istreambuf_iterator<char>( inputFile ), istreambuf_iterator<char>() );
   
   for ( char c : input )
   {
      int i = toupper( c ) - 'A';
      if ( 0 <= i && i < N ) occurrences[i]++;
   }

   int minCount = input.size(), maxCount = 0;
   for ( int i = 0; i < N; i++ ) 
   {
      if ( occurrences[i] < minCount ) minCount = occurrences[i];
      if ( occurrences[i] > maxCount ) maxCount = occurrences[i];
   }
   
   for ( int i = 0; i < N; i++ ) cout << (char)( i + 'A' ) << ": " << occurrences[i] << '\n';
   
   cout << "\nMost frequent letters (count = " << maxCount << "): ";
   for ( int i = 0; i < N; i++ ) if ( occurrences[i] == maxCount ) cout << (char)( i + 'A' ) << " ";
   cout << "\nLeast frequent letters (count = " << minCount << "): ";
   for ( int i = 0; i < N; i++ ) if ( occurrences[i] == minCount ) cout << (char)( i + 'A' ) << " ";
}


A: 28
B: 2
C: 62
D: 7
E: 49
F: 19
G: 5
H: 7
I: 73
J: 0
K: 0
L: 13
M: 19
N: 65
O: 39
P: 9
Q: 2
R: 49
S: 26
T: 58
U: 43
V: 0
W: 0
X: 5
Y: 1
Z: 1

Most frequent letters (count = 73): I 
Least frequent letters (count = 0): J K V W 
L52 - L67 can be vastly simplified without any loops, templates etc.

flipped is sorted by ascending occurrences. So the first element has the lowest occurrences and the last occurrences has the greatest. So lowest is flipped.begin() and highest is prev(flipped.end()) or flipped.rbegin().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <iostream>
#include <map>
#include <fstream>
#include <algorithm>
#include <string>
#include <cctype>
#include <iterator>

using namespace std;

int main()
{
	ifstream inputFile("main.cpp");

	if (!inputFile)
		return (cout << "Cannot open input file\n"), 1;

	string input((istreambuf_iterator<char>(inputFile)), istreambuf_iterator<char>());
	map<char, int> occurrences;
	multimap<int, char> flipped;

	for (auto& c : input)
		if (isalpha(static_cast<unsigned char>(c)))
			++occurrences[static_cast<char>(toupper(static_cast<unsigned char>(c)))];

	for (const auto& [ch, cnt] : occurrences) {
		cout << ch << " = " << cnt << '\n';
		flipped.emplace(cnt, ch);
	}

	cout << '\n';

	for (const auto& [cnt, ch] : flipped)
		cout << cnt << " = " << ch << '\n';

	cout << '\n';

	const auto cmax {flipped.crbegin()};
	const auto cmin {flipped.cbegin()};

	cout << "Predominant letter(s) with " << cmax->first << " occurrences: ";
	for (auto xit = cmax; xit != flipped.crend() && xit->first == cmax->first; ++xit)
		cout << xit->second << ' ';

	cout << "\nMarginal letter(s) with " << cmin->first << " occurrences: ";
	for (auto xit(cmin); xit != flipped.cend() && xit->first == cmin->first; ++xit)
		cout << xit->second << ' ';

	cout << '\n';
}


Based upon text from the first post:


A = 33
B = 4
C = 19
D = 13
E = 53
F = 4
G = 7
H = 23
I = 27
K = 6
L = 23
M = 10
N = 23
O = 35
P = 14
R = 27
S = 30
T = 51
U = 16
V = 3
W = 8
X = 1
Y = 5

1 = X
3 = V
4 = B
4 = F
5 = Y
6 = K
7 = G
8 = W
10 = M
13 = D
14 = P
16 = U
19 = C
23 = H
23 = L
23 = N
27 = I
27 = R
30 = S
33 = A
35 = O
51 = T
53 = E

Predominant letter(s) with 53 occurrences: E
Marginal letter(s) with 1 occurrences: X
Last edited on
Hello. Thank you for your help. I really appreciate it. I like to see how others sense the same problem and how they develop code differently. I would like to use Templates as an exercise - no more. Template is new for me. I understand that it can help dev so as to simplify some process like a single comparaison. I tried to do that, but I could not find a good way to retrieve letters according to their integer index. Thank you for your help and your comments. I guess that I found a good place here. See you later ++
Your flipped thing was incorrect as it didn't account for different letters having the same frequency. To do so, it would have to be something like a std::multimap, and you'd insert the pairs from occurrences, rather than using assignment.
if you wanted to template this, you could set up a version to support various letter encodings (unicode, some of the 16 bit schemes, whatever).

that would actually be mildly practical (they all can be treated like a 32 bit integer, so its not THAT practical, but if you ignore that detail and work with the provided base type, it would be similar to simple real world use cases).
Thank you for all your kind explanations which are really useful. Finally I changed my code so that it could be optimized according your advices. Thank you ++

PS : whoa! you got an impressive C++ shell tool. Good stuff ++
Not really. It doesn't support C++17 or C++20.
Oh. I did not notice it. It's a shame :/

Finally I develop my exercice using code which lastchance had wrote. It seems to me really clever - and I added a Template so as to compute letters percentage in the text. I know that it is just useless, but I wanted it as an exercice. I am not used to this semantic. Thank you for all your advices++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <fstream>
#include <iomanip>
using namespace std;

template <typename T>
T percent(const T o, const T n)
{   // occurrence / input.length * 100
    return (o / n) * 100;
}

int main()
{
    const int N = 26;
    int occurrences[N] = { 0 };
    // here your text file
    ifstream inputFile(__FILE__);
    string input = string(istreambuf_iterator<char>(inputFile), istreambuf_iterator<char>());

    cout << "Number of characters in the text : " << input.length() << endl;
    cout << endl;

    for (char c : input)
    {
        int i = toupper(c) - 'A';
        if (0 <= i && i < N) occurrences[i]++;
    }

    int minCount = input.size(), maxCount = 0;

    for (int i = 0; i < N; i++)
    {
        if (occurrences[i] < minCount) minCount = occurrences[i];
        if (occurrences[i] > maxCount) maxCount = occurrences[i];
    }

    for (int i = 0; i < N; i++) cout << " " << (char)(i + 'A') // letter
        << " : " << occurrences[i] // occurrence
        << "\t (" << fixed << setprecision(2) << percent<double>(occurrences[i], input.length()) << " %)" << endl; // percentage

    cout << "\nMost frequent letters (count = " << maxCount << "): ";
    for (int i = 0; i < N; i++) if (occurrences[i] == maxCount) cout << (char)(i + 'A') << " ";
    cout << "\nLeast frequent letters (count = " << minCount << "): ";
    for (int i = 0; i < N; i++) if (occurrences[i] == minCount) cout << (char)(i + 'A') << " ";
    cout << endl;
}


Output :

Number of characters in the text : 1469

 A : 32  (2.18 %)
 B : 4   (0.27 %)
 C : 77  (5.24 %)
 D : 10  (0.68 %)
 E : 83  (5.65 %)
 F : 22  (1.50 %)
 G : 7   (0.48 %)
 H : 12  (0.82 %)
 I : 78  (5.31 %)
 J : 0   (0.00 %)
 K : 0   (0.00 %)
 L : 20  (1.36 %)
 M : 22  (1.50 %)
 N : 86  (5.85 %)
 O : 52  (3.54 %)
 P : 18  (1.23 %)
 Q : 2   (0.14 %)
 R : 63  (4.29 %)
 S : 31  (2.11 %)
 T : 85  (5.79 %)
 U : 53  (3.61 %)
 V : 0   (0.00 %)
 W : 0   (0.00 %)
 X : 8   (0.54 %)
 Y : 2   (0.14 %)
 Z : 1   (0.07 %)

Most frequent letters (count = 86): N
Least frequent letters (count = 0): J K V W
Last edited on
Mmm, but your percentages are wrong (they won't add up to 100% by quite a large margin), at least as percentages of total letter count.

The problem is that the string input includes a lot of things other than letters.(Hopefully you have lots of spaces in your code, at least, as well as a few numbers and punctuation).

It is inefficient to delete from a string. Just sum the contents of occurrences[] to get total letter count.
Right. I have to fix this mistake ++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <iostream>
#include <fstream>
#include <iomanip>
using namespace std;

template <typename T>
T percent(const T o, const T n)
{   // occurrence / input.length * 100
    return (o / n) * 100;
}

int main()
{
    const int N = 26;
    int counter = 0;
    int occurrences[N] = { 0 };
    // here your text file
    ifstream inputFile(__FILE__);

    if (!inputFile)
	return (cout << "Cannot open input file\n"), 1;

    string input = string(istreambuf_iterator<char>(inputFile), istreambuf_iterator<char>());

    for (char c : input)
    {
        int i = toupper(c) - 'A';
        if (0 <= i && i < N)
        {
            occurrences[i]++;
            counter++;
        }
    }

    int minCount = input.size(), maxCount = 0;

    for (int i = 0; i < N; i++)
    {
        if (occurrences[i] < minCount) minCount = occurrences[i];
        if (occurrences[i] > maxCount) maxCount = occurrences[i];
    }

    for (int i = 0; i < N; i++) cout << " " << (char)(i + 'A') // letter
        << " : " << occurrences[i] // occurrence
        << "\t (" << fixed << setprecision(2) << percent<double>(occurrences[i], counter) << " %)" << endl; // percentage

    cout << endl;
    cout << "Number of characters in the text : " << counter << endl;
    cout << "Most frequent letters (count = " << maxCount << "): ";
    for (int i = 0; i < N; i++) if (occurrences[i] == maxCount) cout << (char)(i + 'A') << " ";
    cout << "\nLeast frequent letters (count = " << minCount << "): ";
    for (int i = 0; i < N; i++) if (occurrences[i] == minCount) cout << (char)(i + 'A') << " ";
    cout << endl;
}



 A : 32  (4.12 %)
 B : 4   (0.52 %)
 C : 81  (10.44 %)
 D : 10  (1.29 %)
 E : 85  (10.95 %)
 F : 22  (2.84 %)
 G : 5   (0.64 %)
 H : 10  (1.29 %)
 I : 77  (9.92 %)
 J : 0   (0.00 %)
 K : 0   (0.00 %)
 L : 18  (2.32 %)
 M : 22  (2.84 %)
 N : 86  (11.08 %)
 O : 56  (7.22 %)
 P : 16  (2.06 %)
 Q : 2   (0.26 %)
 R : 67  (8.63 %)
 S : 31  (3.99 %)
 T : 86  (11.08 %)
 U : 55  (7.09 %)
 V : 0   (0.00 %)
 W : 0   (0.00 %)
 X : 8   (1.03 %)
 Y : 2   (0.26 %)
 Z : 1   (0.13 %)

Number of characters in the text : 776
Most frequent letters (count = 86): N T
Least frequent letters (count = 0): J K V W
Last edited on
@Geckoo, a suggestion to save you a lot of possible grief later when dealing with files....

ALWAYS check the status of opened files, input and output, before you begin processing them. Especially input files.

What would happen if you move your app's executable to a different location and not the input file? Or if you change the name of the input file and don't update the name in your source code?

Especially annoying is when your app prompts a user for a file name needed for input and they mistype it.

The code seeplus gave shows one method how to deal with possible problems when opening files. Lines 15-16.

Never assume a file's opened status, always check.
Last edited on
You are right Furry Guy. It is very important to foresee some user mistakes. Done. Thank you ++
Last edited on
Topic archived. No new replies allowed.