Need help coming up with approach

Pages: 12
So I want to make a hangman game, where the computer tries to guess the word you're thinking of.

How would I find the most repeated letter that only 50% or around 50% have? (Well eliminating possible words in that way, is all I could come up with for making an efficient hangman guesser)

But that aside, what's a better way to do this, if there is? The hangman should not ask the user to type the placement of the letters (my requirement), it can only ask whether the word contains a certain letter and then finally if all words have the same letters it starts asking about placement.


Here's a function I wrote to find occurrences of characters in a file, to give you an idea of how I want to deal with the letters (char_occurance[] array, namely). By the way this function blindly counts all letters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
char find_most_occuring(string file_name, string exten = ".txt", string direc = "") {
	if (direc.empty())
		file_name += exten;
	else
		file_name = direc + '\\' + file_name + exten;

	long long int char_occurance[256] = { 0 };
	char highest = 97;

	ifstream file(file_name);
	string line;

	while (getline(file, line)) {
		for (char i : line) {
			char_occurance[i]++;
			if (char_occurance[i] > char_occurance[highest])
				highest = i;
		}
	}
	return highest;
}
you can look online for a breakdown of the english language as to letter frequency. Or watch wheel of fortune for like 10 min. RSTLNE are pretty high on the list. S may drop way off if you disallow plurals and such; I don't know.

does the game know the length of the word? Traditional hangman, you did know this, and as you got letters you see where they are placed.

the 'best' approach is to do this:
get the word length
get a list of all the words of that length
pick the highest frequency letter in that group that has not yet been picked (this is moderately expensive for like 4-5 letter words which make up the bulk of the language but you could pre-compute a list of these from your dictionary once and code it in)
eliminate words that <have, do not have> that letter depending on the answer from the guess
repeat until word is understood.
you have to know the letter orders IMHO. consider just a simple word, eat, tea, ate are three anagrams of one three letter word, and it gets worse with longer words.


the practical approach is to just use the english word statistics (slightly randomized to prevent gaming the system) to pick your letters, and then the same logic, eliminate words from the list until you know it.

the user must pick a word in the game's dictionary.
Last edited on
Thanks a lot jonnin.

I will post the entire program once it's done on this forum so anybody can review it.

"get the word length"
Yes this is very necessary

"get a list of all words of that length"
I have a separate text document for each number of letters. So 2 letter words are in a separate file than 3 letter words.

"pick the highest frequency"
So basically, my previous function, but I must make sure I'm not counting multiple letters from the same word.

Eventually, however, we will be left with letters that only few words have like z. What then? This was my initial confusion.

We're trying to take as less guesses as possible.

I get that taking orders will be handy, but can we make it so that we keep asking orders as a later priority when either there's too less frequency of non-same letters or just none if it comes to it. This, I think, is the hardest part. That's the only LOGIC that I'm not able to come up with.
highest frequency isnt bad at all in ascii
unsigned int freq[256] = 0;
for(all the words)
upper or lower case the word if necessary
for(all the letters in each word)
freq[word[letter]]++;

Eventually, however, we will be left with letters that only few words have like z. What then? This was my initial confusion.
this is just probability.
if you have discovered that A and P are in a 3 letter word, you have stuff like
pad
cap
zap //unusual letter problem
map
gap
lap //order problem!
pal //order problem!
ape
and so on

you just have to keep going, if you INSIST on the ordering of the letters to be the last thing you do.
your likely english letter test is going to check z as one of the last possible letters.
but some letter just don't have a word.. there is no PA(x) combo. There is no PA(h) combo. Or whatever. You have to track what words are still possible for it to be, what letters those words have, and what the next most common english letter is that you have not tried that is in the remaining words, or which has the highest freq of the remaining, your call on that. Say the next letter to check is c... if that is correct, its cap, done. If not eliminate cap, and try next letter.. Eventually you get to z and the only option is zap, and you guessed it. if you pick L and its not the word, you eliminate 2 words in your possibles. If it is the word, it knows all 3 letters but not the order, so it has to ask about order at that point, however you want to handle this.




Last edited on
I wouldn't worry too much about performance. A dictionary of 100,000 words would take only about a megabyte.

I suggest something a little different from Jonnin: given a set of words, guess the letter that most nearly cuts the set in half. In other words, compute the frequency of each letter and then choose the letter whose frequency is closest to 1/2 the size of the list.
That works. If you go my route of trying to eliminate the most words each time, you actually leave a lot more words behind if the letter is a success. Cutting it in half gives you the most win whether the letter is in there or not.
Suppose there are 4 words that are of 4 letters long and either of the four is the required word that the user is thinking of.

4 words are:
-> abode, alift, acorn, prone

a is the letter most occurring in all the words.

-> Program asks user if 'a' is in his word.

If it's not then, 'prone' is the only word left and hence we're done.
But if the word does have an 'a', we have abode, alift and acorn left.

Now 'o' is the most occurring so we ask the user if the word has an 'o'.

We're left with abode and acorn if the user said yes.
Now it asks whether 'b' is in the word.


Is this right? Is this the best way? Can there be an instance where this is not efficient in that it takes a lot of tries to guess the user's word?
How would one implement dhayden's approach?

jonnin's approach seems to be simple enough, I don't know if it's the most efficient when it comes to large number of words though. I will try to make a sample program which shows the statistics later on when I have free time, seems like it's going to be a fun thing to do.
Also in jonnin's approach, we need a way to foresee anagrams.
Would this be the best way: If highest occurrence' frequency is 1, check if all words are anagrams. If they're not all anagrams do same approach as before, otherwise ask for placement of a random letter whose placement is not repeated in other words.

Also we must not ask for letters that all the words have.

Awesome. Thought it would be more complex.
Last edited on
+ I have to add function to check if all words are anagrams and then that's all right? Then just write logic.

Any pointers or mistake corrections would be appreciated,
Here is the full program as of now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
#include <iostream>
#include <fstream>
#include <string>
#include <cstdio>

using namespace std;

int take_num_letters(void) {
	int number_of_letters;
	cout << "How many letters does your word have?: ";
	cin >> number_of_letters;
	return number_of_letters;
} // Takes Input

void initialize_file(string file_name, string exten = ".txt", string direc = "") {
	if (direc.empty())
		file_name += exten;
	else
		file_name = direc + '\\' + file_name + exten;

	remove(file_name.c_str());
	ofstream initializing(file_name);
	initializing.close();
} // Initializes File

bool if_num_file_exist(int number_of_letters) {
	ifstream file("words\\" + to_string(number_of_letters) + "_letters.txt");
	if (file) {
		file.close();
		return true;
	}
	else {
		file.close();
		return false;
	}
}

char find_most_occuring(string file_name, unsigned int &occ, string exten = ".txt", string direc = "") {
	if (direc.empty())
		file_name += exten;
	else
		file_name = direc + '\\' + file_name + exten;

	long long int char_occurance[256] = { 0 };
	char highest = 97;

	ifstream file(file_name);
	string line;

	while (getline(file, line)) {
		bool in_word[256] = { 0 };
		for (char i : line) { 
			if (!(in_word[i])) {
				char_occurance[i]++;
				if (char_occurance[i] > char_occurance[highest])
					highest = i;
			}
			in_word[i] = true;
		}
	}
	file.close();
	occ = char_occurance[highest];
	return highest;
}

void file_to_file_write(string file_from, string file_to, string from_exten = ".txt",
	                    string to_exten = ",txt", string direc_from = "", string direc_to = "") {

	if (direc_from.empty())
		file_from += from_exten;
	else
		file_from = direc_from + '\\' + file_from + from_exten;

	if (direc_to.empty())
		file_to += to_exten;
	else
		file_to = direc_to + "\\" + file_to + to_exten;

	ifstream input_file(file_from);
	ofstream output_file(file_to);

	string line;
	while (getline(input_file, line)) {
		output_file << line << '\n';
	}

	input_file.close();
	output_file.close();
}

void update_possibilities(char search_element, bool has_or_not, string file_name,
	string exten = ".txt", string direc = "") {
	if (direc.empty())
		file_name += exten;
	else
		file_name = direc + '\\' + file_name + exten;

	ifstream input_file(file_name);
	ofstream temporary_file("temporary_19245.txt");
	string line;


	while (getline(input_file, line)) {
		bool has = false;

		for (char i : line)
			if (i == search_element)
				has = true;

		if (has_or_not) {
			if (has)
				temporary_file << line << '\n';
		}
		else
			if (!has)
				temporary_file << line << '\n';
	}
	input_file.close();
	temporary_file.close();

	remove(file_name.c_str());
	rename("temporary_19245.txt", file_name.c_str());
}

bool is_all_anagram() {

}


void retry_message(void) {
	system("cls");
	cout << "Hmm I don't know of any word with that "
		<< "many letters, try again.\n\n";
} 

int main() {
	int num_letters;
	
	while (num_letters = take_num_letters()) {
		if (if_num_file_exist(num_letters)) {
			initialize_file("possibilities");
			file_to_file_write("words\\" + to_string(num_letters), "possibilities");
			break;
			// write from file to possibilities
		}
		else {
			retry_message();
		}
	}
	
	unsigned int num_occ{ 0 };
	char most_occuring_char =  find_most_occuring("possibilities", num_occ);
	char asked_chars[20];

	system("pause");
	return 0;
}
Last edited on
you can sort the strings and remove the duplicate letters (easy after its sorted) to do the anagram comparisons. Give modern hardware its due... without a lot of effort trying to speed it up it would still do well over 100k words/ sec.


How would one implement dhayden's approach?
its just a change in how you pick the letter you want to guess.
abode, alift, acorn, prone
if you pick e instead of a, no matter if its there or not, you get rid of 1/2 the possible words:
abode, prone remain if e is in the word
alift, acorn remain if it isnt.

to find that, you still have the same frequency table.. a quick pass at it by hand gives
a3
o3
e2
n2
r2
b1
c1
d1
i1
l1
p1
t1

so what is closest to 1/2 the list? you have e,n,r. Of those you can pick randomly (remember when I said picking randomly prevents gaming the system or making it predictable and boring?) or you can guess at trying to be efficient (pick e because its most common letter maybe?)
Last edited on
Okay here's the anagram checking function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
bool is_all_anagram(string file_name, string exten = ".txt", string direc = "") {
	if (direc.empty())
		file_name += exten;
	else
		file_name = direc + '\\' + file_name + exten;

	ifstream file(file_name);
	string line, first_word;
	getline(file, first_word);
	sort(first_word.begin(), first_word.end());
	
	while (getline(file, line)) {
		sort(line.begin(), line.end());
		if (line != first_word) {
                        file.close();
			return false;
		}
	}
        file.close();
	return true;
}



So in dhayden's case I need to iterate over a list and remember the character whose frequency is closest to half the size of the list which I can do by finding (magnitude/modulus of (1/2*list_size - frequency of particular letter)

Doing something similar to your approach, if I were to do what I said,
i.e eliminate highest frequency and if frequency is 1 then check of anagram and if yes then ask for placement, would that be efficient?

But how would I introduce randomness?

Also should I use address-of operator everywhere I can to prevent copying of data? Don't see a lot of people doing it so is it like a convention (maybe it's more confusing to readers than helpful to the program's efficiency)?

Where's a nice place to read C++ conventions, if you happened to know of any? Appreciate it.
Last edited on
you should pass by reference (its a reference, not address of) for anything larger than an integer/pointer. We may not always do that in quick short examples here. I know I forget it when typing short examples. The difference for a function that is called a LOT of times is huge; the difference for a function called very infrequently is less. If you call a function in a for loop and copy a big class every iteration, you can see it when you profile the code.

randomness... whenever you have a choice might be enough. As in my example, there are 3 choices for next letter that cut your word list in half either way. Randomly pick one there. If there is only one, pick that one, if there are more than 1, pick one of them. /shrug

c++ conventions vary wildly and very little is common across different businesses or even projects. There are many styles out there that you can read (just web search it), but they will disagree on many details. Use of references is not a convention or style, though, its more efficient to use them when passing large data, period, so that is 'right' not 'style'. Technically they should be const references if you do not change the value inside, and if you do change it, remember that it changes the original!

For randomness how about shuffling the words? Then If there is a case where more than one word which has the highest frequency, either of them can be chosen. Maybe shuffling the words after getting no. Of letters once?

Or what's a better?
Last edited on
By the way, this is pretty different from a normal hangman game. In the normal game, you only get penalized for a wrong answer (which makes my approach inappropriate). After some fixed number of wrong answers you lose. Also you learn the position of correct letters.

So are you trying to recreate the normal game, or do you really mean to create something new?
I guess in this case it's the computer who's guessing and you who's penalizing? :P

By the way for shuffling the words in a file, taking all words into memory would be a bad idea right? There would be at least 50 000 words in that file and if each word is 10 bytes (10 letter word) then that would require 500 000 bytes of RAM.

How should I do this? Should I use rand() and place words from the file into say 5 different files and then take back words from the 5 different files to the original file in random order? That seems like it would be expensive too.

What after taking into RAM if I had to? How would I shuffle it then too?
I guess in this case it's the computer who's guessing and you who's penalizing? :P

Does that mean this should implement the standard game where the computer guesses the word? That means you have to tell it the positions of correct letters. It also means there's no penalty to guessing a correct letter. These things greatly affect the design. For example, the whole anagram issue disappears.

If this is an assignment, can you post the text of the assignment? Often little things in the requirements make a difference.

that would require 500 000 bytes of RAM
500,000 bytes of RAM isn't that much. It's 0.05% of 1GB. I'd read the words into a vector and shuffle with std::shuffle(). The whole process will be about 10 lines of code.
Dhayden I really appreciate the fact that you are going to the extent to make sure that, if this were an assignment, I'm doing it right. ^_^

It's not an assignment and I guess I should have explained in better detail (about the how the program is supposed to work) at the beginning.

I'm trying variations. Asking for placements (before we are stuck with anagrams) can be another variation for another time, I'm just playing around pretty much.

In this variation the user thinks of a word and the computer asks if the word contains certain letters. Yes, originally, the user would have to give placement, that's how this type of hangman should work, but I want to avoid that to the extent possible, of course it's inevitable when left with anagrams.


Yes I think using vectors would be a good idea. I know I am over-complicating a lot but well I know how to use vectors, I won't learn anything from using vectors, I want to learn to think of different possible approaches.

Processing vs RAM, what's more efficient? (Files vs Vectors basically) at least for this situation what would be considered more efficient, if we were to just shuffle?

vectors is easier but which is more efficient?

Last edited on
Thanks for the explanation.

Processing vs RAM, what's more efficient? (Files vs Vectors basically)

Accessing data on a mechanical hard drive is a million times slower than RAM (milliseconds vs. nanoseconds). I think this is a case where a file-based solution is more code and slower - the worst of both worlds.
I frequently read 1-4 gb files at work directly into memory to process. I don't shuffle them but I do some pretty brute force touch each letter stuff and it takes all of 30 seconds for the worst of my programs. You are using a tiny fraction of this, it won't take much time at all.
Pages: 12