Trying to get at a string index from a specific string in a string array

Pages: 12
Sounds crazy but I'm not sure how else to explain it.

I am working on a parsing program that will change the orthography (writing system) of a source to one that I use. I am doing this to help with translations of older historical materials. Sometimes when I need to swap out a character for another, there is the potential that I want two possible outcomes. So I double the number of strings in my wstring array and cycle through accordingly to give half one character and the other half another character. I got that part working.

These multiple outputs helps me visualize possible translations with terms I am stuck on.

Now that the small explanation is out of the way, sometimes I need to make the PREVIOUS character a "double output" which will leave it as is along with a new character. I do this when I find an "n" in the string. If I find an "n" and if the previous vowel was either an a, i, or u, I want to double the output and crank out those regular vowels along with their ogonek variants (they'll have a little "tail").

Any help would be greatly appreciated :).

Here is my function:

vector<wstring> Parser(wchar_t* pInput, int iStringLength)

Here is the code I am having trouble with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
else if (pInput[i] == 'n')
		{

			// This character may output two possible outcomes.
			// Do a, i, or u (with ogonek) and the other regular.

			// Check if there is a previous character.  Because if there isn't
			// any, then do nothing.
			if (i >= 1)
			{
				// Now check to see if the previous characters are candidates
				// for nasalization.
				if ((pInput[i - 1] == 'a') || (pInput[i - 1] == 'i') || (pInput[i - 1] == 'u'))
				{
					// Now we can get into the double output
					// Multiply the count by 2
					g_iCount *= 2;

					// Resize the vector array
					g_Results.resize(g_iCount);

					// Copy the contents to both halves of the newly doubled array
					int cnt = 0;
					for (int j = ((g_iCount / 2)); j < (g_iCount); j++)
					{
						g_Results[j] = g_Results[cnt];
						cnt++;
					}

					// Now check which vowel was before the "n" and nasalize accordingly
					if (pInput[i - 1] == 'a')
					{
						// Store the index position that needs to be swapped out
						int tempIndex = i - 1;

						// Now cycle through all of the strings
						for (int k = 0; k < (g_iCount / 2); k++)
						{
							
							// NEED TO GET AT THE INDEXES IN THE STRINGS IN g_Results HERE!!!

						}
					}

				}
			}


			// Now keep the "n" where it needs to go.
			for (int j = 0; j < g_iCount; j++)
			{
				g_Results[j] += L"n";
			}

		}
If I understand you correctly, you want versions of the strings with all possible permutation of the double-char / char-with-ogonek. Even on a word basis (i.e. if a word has two nasal vowels, then there will be four permutations due to it alone.)

If so, you just need to loop though the strings and append the duplicated char to the first half of the array, and switch the last char to the ogonek equivalent in the other half.

Where these are assumed to be globals

1
2
size_t          g_iCount = 1;
vector<wstring> g_Results(g_iCount);


and

const wchar_t a_with_ogonek = L'\x0105 ';

A stripped down but complete version of your function is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
vector<wstring> Parser(const wchar_t* pInput, int iStringLength) // EDIT added const
{
	// EDIT
	for(int i = 0; i < (iStringLength); i++)
	{
		// else what?
		if (pInput[i] == L'n') // EDIT made wide
		{
			// This character may output two possible outcomes.
			// Do a, i, or u (with ogonek) and the other regular.

			// Check if there is a previous character.  Because if there isn't
			// any, then do nothing.
			if (i >= 1)
			{
				// Now check to see if the previous characters are candidates
				// for nasalization.
				if ((pInput[i - 1] == L'a') || (pInput[i - 1] == L'i') || (pInput[i - 1] == L'u')) // EDIT made wide
				{
					// Now we can get into the double output
					// Multiply the count by 2
					g_iCount *= 2;

					// Resize the vector array
					g_Results.resize(g_iCount);

					// Copy the contents to both halves of the newly doubled array
					int cnt = 0;
					for (int j = ((g_iCount / 2)); j < (g_iCount); j++)
					{
						g_Results[j] = g_Results[cnt];
						cnt++;
					}

					// Now check which vowel was before the "n" and nasalize accordingly
					if (pInput[i - 1] == L'a')
					{
						// Store the index position that needs to be swapped out
						
						// EDIT not needed
						//int tempIndex = i - 1;

						// Now cycle through all of the strings
						
						//EDIT
						//for (int k = 0; k < (g_iCount / 2); k++)
						for (int k = 0; k < g_iCount; k++)
						{
							// NEED TO GET AT THE INDEXES IN THE STRINGS IN g_Results HERE!!!

							if (k < (g_iCount / 2))
							{
								// replace with a with ogonek
								size_t pos_last = g_Results[k].length() - 1;
								g_Results[k][pos_last] = a_with_ogonek;
							}
							else
							{
								// double
								g_Results[k] += L'a';
							}
						}
					}
				}
			}

			// Now keep the "n" where it needs to go.
			for (int j = 0; j < g_iCount; j++)
			{
				g_Results[j] += L"n";
			}
		}
		// EDIT
		else
		{
			for (int k = 0; k < g_iCount; k++)
			{
				g_Results[k] += pInput[i];
			}
		}
	}
	// EDIT

	// EDIT
	return g_Results; // or whatever?
}


Andy

PS While your function returns vector<string>, the variable you're using has a g_ prefix, suggesting it's a global. Does it need to be? Also, g_Count is probably uncalled for (not sure, as I've not seen the rest of your code, but...) as g_Results.size() will return the current size.
Last edited on
Thanks for the reply!

I'll take a look at your code and work on seeing what you are doing with it. In the meantime I just wanted to say thanks :).

Also, you are most likely right about the global g_iCount variable. It might not need to be global and might not even be necessary. I can program just enough to be dangerous and am still getting a feel for the "best practices" that are out there :).
I got it to work! Still getting my head wrapped around HOW it is working though. I had to make a couple changes to your code which I'll show below.

Also, I'm going to look into your mention of the global vector<string>. I need the size of it in my dialog box code to be able to populate the listbox the appropriate number of times. I wonder if I could get that number some other way. Right now I am using the g_Results.size() for that and I can see where you are coming from about using that in the place of my g_iCount variable elsewhere.

I've also avoided creating a listbox class for this little program but I am wondering if it would be a good idea to do so. Not just because it may be a good programming practice but I need to learn at some point!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
else if (pInput[i] == L'n')
		{

			// This character may output two possible outcomes.
			// Do a, i, or u (with ogonek) and the other regular.

			// Check if there is a previous character.  Because if there isn't
			// any, then do nothing.
			if (i >= 1)
			{
				// Now check to see if the previous characters are candidates
				// for nasalization.
				if ((pInput[i - 1] == L'a') || (pInput[i - 1] == L'i') || (pInput[i - 1] == L'u'))
				{
					// Now we can get into the double output
					// Multiply the count by 2
					g_iCount *= 2;

					// Resize the vector array
					g_Results.resize(g_iCount);

					// Copy the contents to both halves of the newly doubled array
					int cnt = 0;
					for (int j = ((g_iCount / 2)); j < (g_iCount); j++)
					{
						g_Results[j] = g_Results[cnt];
						cnt++;
					}

					// Now check which vowel was before the "n" and nasalize accordingly
					if (pInput[i - 1] == L'a')
					{

						// Now cycle through all of the strings
						for (int k = 0; k < g_iCount; k++)
						{
							size_t pos_last = g_Results[k].length() - 1; // EDIT put this here

							if (k < (g_iCount / 2))
							{
								// Replace with a with ogonek
								g_Results[k][pos_last] = 0x0105; // So I could use it here
							}
							else
							{
								// Double
								g_Results[k][pos_last] = L'a'; // And here (had to change this line)
							}
						}
					}
				}
			}


			// Now keep the "n" where it needs to go.
			for (int j = 0; j < g_iCount; j++)
			{
				g_Results[j] += L"n";
			}

		}
I'm unclear why you changed line 74 -- didn't you want to double the character if it was not replaced by an a with oganek? As it now stands, the code appear to be replacing an 'a' with another 'a', which is pointless (you might as well do nothing.)

I am assuming here that the chars are copied to the results strings one after another, so when you're processing the 'n', the 'a' has already been added to all the result strings.

Andy
No, I wanted to double the number of strings and output one with a regular "a" and the other with "ą". So if I put "an" in my textbox, I get this in my listbox:

an
ąn

And if I put "anan" in my textbox, I get:

anan
anąn
ąnan
ąnąn

Which is exactly what I want. If the program finds an "n" character, it will check the preceding character for either an a, i, or u. If one of those are there, it will double the number of strings and give me both the regular and ogonek versions of those vowels.

I apologize if I was unclear. I was wanting to double the number of strings, not the character.
I was wanting to double the number of strings, not the character.

Oops...

In that case, I need to undo the miscorrection I made:

1
2
3
4
5
6
7
8
						// Now cycle through all of the strings
						for (int k = 0; k < (g_iCount / 2); k++) // EDIT back to what you had originally
						{
							size_t pos_last = g_Results[k].length() - 1;

							// Replace with a with ogonek
							g_Results[k][pos_last] = 0x0105;
						}


Also, while I'm here, why does Parse() take a (const) wchar_t* rather than a (const) wstring& ?? Is this due to the way you read your file/data?

Andy

PS Note that it is seen as better practice to declare a const with a suitable, self-explanatory name, than to use a "magic number" like 0x0105. Which is why I declared a_with_ogonek.

What is a magic number, and why is it bad?
http://stackoverflow.com/questions/47882/what-is-a-magic-number-and-why-is-it-bad
Last edited on
It takes in a wchar_t* because that is what I (perhaps somewhat ignorantly) decided on. Would a wstring& work better? Like I said, I know just enough programming to be dangerous and am still learning the "best practices" that are out there :). If it would be better to adjust it, I'll go that route.

As far as the hex values, I found out that I could just copy/paste a character like ų or ð right into my code (IE blahblah += L"ð";). That gets rid of the "magic number" aspect. Is that route not ideal as well? Would a const variable (are you meaning something like "const wstring aWithOgonek = 0x0105;"?) work better?

As of right now my program doesn't read from a file. It is just a little dialog box that has an edit box where a user types in a word to be parsed. Then the user selects the source of the term via a combo box (which reminds me, I need to figure out something about that...I'll start another post) so the program will know which "orthography parser" needs to be used. Then it cranks out the possibilities and lists them in a listbox.

Your mention of reading from a file has given me an idea. Do you think it is feasible to read from an entire file and for each word generate something like a combo box on a form that will hold each possibility that the parser cranks out? Then the user can select the best option from each combo box which will reflect in a static label above those boxes (for easier reading). That could turn into a LOT of combo boxes (say 5 combo boxes per line) so there would need to be a way to scroll down on the form if it outputs that many. What do you think? It would be a great help with translations. The whole idea for my little project was to come up with a way to look at the different possibilities for an older orthography because sometimes you don't always make a connection in your mind. Having a program crank out the possibilities in our current orthography can help us figure out what terms are being used. This isn't intended to be a translation program (perhaps later) but rather just changing the writing system.

Again, thanks for your help and input. And questioning why I am doing something a certain way. I am an aspiring programmer and need all the help I can get :).
wstring&

wstring& would be safer. (Are you using a wstring to store your text, or a C-style wchar_t array?)

"ð"

On reflection, I think my tendency to use the hex approach is very probably a throwback to the olden days when neither the Visual Studio IDE editor nor the Microsoft compiler could handle Unicode text, so any character outside the standard set had to be encoded like '\x0105' (note that 0x0105 is an int literal, whereas '\x0105' is a wchar_t literal.) In the case you're dealing with, actual literals like "ð" and "ą" would be fine as they are what they are! (With the right text encoding, of course.)

[A] LOT of combo boxes

Rather than create loads of combo boxes, what you can do is keep track of where the words with ogoneks are and then create/display a temporary combo box (or context menu?) when needed. This approach is used by spell checking edit controls, such as this one:

A WTL Hunspell-checked Edit Control
http://www.codeproject.com/Articles/37517/A-WTL-Hunspell-checked-Edit-Control

Spell Checking Edit Control (Using HunSpell)
http://www.codeproject.com/Articles/21381/Spell-Checking-Edit-Control-Using-HunSpell
(this one uses MFC)

These apps uses a customized edit controls which keeps track of where the misspelt words are displayed. The controls use this information to mark the bad words and, when the user right clicks, to display a context menu offering corrections. You should be able to use the same approach to swap the alternative spellings of your words.

(While they're using WTL and MFC, the same sort of approach could be used with raw Win32.)

Saving the choices?

A knock on effect of this would be that you would then you'd ideally need a way to save the choices you've made for later use.

Andy
Last edited on
My take on the Parser() function. It now works with wstring and handles ą, į, ų (now actual literals in the code.)

I have also replaced the code that copied the array with the standard copy algorithm.
http://www.cplusplus.com/reference/algorithm/copy/

And keep track of the last char rather than getting it back out of the input string later, etc.

The app includes a little test, which write to the console (but I can see no difference between the a and ą there) and to Unicode text file (where you can see the difference.)

See following posts for:
- supporting headers : utils.h and tee.h
- some of the results

(Note that the code has been de-tabified so it display better here; so if you want to try it out, and you prefer tabs, you'll just have to reverse the process.)

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
#define _SCL_SECURE_NO_WARNINGS
#include <iostream>
#include <iomanip>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <cstdio>  // for _fileno
#include <io.h>    // for _setmode
#include <fcntl.h> // for _O_U16TEXT
#include "utils.h" // for miscellaneous utility routines
#include "tee.h"   // for basic ostream tee
using namespace std;

vector<wstring> Parser(const wstring& Input);

void Test_Parser();

int main()
{
    // required to get VC++ runtime to diplay Unicode on
    // console (though chars don't look any different to me...)
    // The chars that turn up in the output file appear ok, though.
    _setmode(_fileno(stdout), _O_U16TEXT);

    Test_Parser();

    return 0;
}

vector<wstring> Parser(const wstring& Input)
{
    vector<wstring> Results(1); // assume one string to start with

    wchar_t ch_last = L'\0';

    const size_t Input_length = Input.length();
    for(size_t i = 0; i < (Input_length); ++i)
    {
        wchar_t ch_this = Input[i];

        if (ch_this == L'n')
        {
            // This character may output two possible outcomes.
            // Do a, i, or u with ogonek (ą, į, ų) and the other regular.

            // Now check to see if the previous characters are candidates
            // for nasalization. If there hasn't been a last char yet, then
            // ch_last will still be null (L'\0')
            if (utils::is_nasalizable(ch_last))
            {
                // Now we can get into the double output
                const size_t iCount_old = Results.size();
                // Resize the vector array
                Results.resize(2 * iCount_old);

                // Copy the contents to both halves of the newly doubled array
                // (using std::copy from <algorithms> header)
                vector<wstring>::iterator iterHalfway = Results.begin() + iCount_old;
                copy(Results.begin(), iterHalfway, iterHalfway);

                // Now get nasalized form of last character
                wchar_t ch_nasal = utils::nasalize(ch_last);

                // Now cycle through first half of the strings and nasalize last char
                for (size_t k = 0; k < (iCount_old); ++k)
                {
                    // replace with a with ogonek
                    wstring& result = Results[k];
                    size_t pos_last = result.length() - 1;
                    result[pos_last] = ch_nasal;
                }
            }

            // no need to handle 'n' here -- that's done like any other chars
        }
        else
        {
        }

        // Add latest (this) char to all strings
        {
            const size_t iCount = Results.size();
            for (size_t k = 0; k < iCount; ++k)
            {
                Results[k] += ch_this;
            }
        }

        // remember last char
        ch_last = ch_this;
    }

    return Results;
}

struct TestCase
{
    const wchar_t* input;
};

const TestCase testCases[] = {
    {L"n"     },
    {L"na"    },
    {L"an"    },
    {L"sultan"},
    {L"banana"},
    {L"animal"},
    {L"Zoltan"},
    {L"banana sultan"},
    {L"animal banana"},
    {L"Zoltan the animal banana sultan"},
    {L"      "},
    {L""      }
};

const size_t testCaseCount = sizeof(testCases) / sizeof(testCases[0]);

void Test_Parser()
{
    const wchar_t filePath[] = L"parser_test_results_msvc.txt";
    const wchar_t UTF16BOM   = L'\xFEFF'; // UTF-16 Byte Order Mark (BOM)

    // This approach appears to be needed to open file you can write
    // unusual unicode chars to. There might be a better way to do this,
    // but I have not yet managed to track it down. :-(
    FILE* fp = _wfopen(filePath , L"w");
    _setmode(_fileno (fp), _O_U16TEXT);
    wofstream ofs(fp);
    ofs << UTF16BOM;

    // Tee, so see o/p in console at the same time as writing to file.
    wteestream os(std::wcout, ofs);

    os << L"Test_Parser begin" << endl;
    os << endl;

    for(size_t index = 0; testCaseCount > index; ++index)
    {
        const TestCase& thisTestCase = testCases[index];
        const wstring   input        = thisTestCase.input;

        const size_t an_count = utils::count_substr(input, L"an");

        os << L"input : \"" << input << "\"" << endl;
        os << L"  " << an_count << L" \"an\"(s)" << endl;
        os << L"  " << utils::raise_to(2, an_count) << L" permutation(s) expected" << endl;
        os << endl;

        vector<wstring> Results = Parser(input);
        os << L"results : " << Results.size() << L" permutation(s) returned" << endl;
        for(size_t index = 0; Results.size() > index; ++index)
        {
            const wstring& result = Results[index];
            size_t a_with_ogonek_count = count(result.begin(), result.end(), L'ą');
            os << L"  Results[" << setw(2) << index << L"] = \"" << result << L"\""
               << L" [a with ogonek count : " << a_with_ogonek_count << L"]" << endl;
        }
        os << endl;
    }

    os << L"Test_Parser end" << endl;
    os << endl;

    fclose (fp);
}
Last edited on
Supporting headers:

utils.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#ifndef Included_Utils_H
#define Included_Utils_H

namespace utils {

inline bool is_nasalizable(wchar_t ch)
{
    return ((ch == L'a') || (ch == L'i') || (ch == L'u'));
}

inline wchar_t nasalize(wchar_t ch)
{
    switch(ch)
    {
        case L'a' : return L'ą';
        case L'i' : return L'į';
        case L'u' : return L'ų';
        // etc
        default: { /* could assert here? */ }
    }
    return ch;
}

inline size_t count_substr(const std::wstring& str, const std::wstring& substr)
{
    const size_t substr_len = substr.length();
    size_t count = 0;
    size_t pos = 0;
    for( ; ; )
    {
        pos = str.find(substr, pos);
        if(pos == str.npos)
            return count;
        ++count;
        pos += substr_len;
    }
}

// for testing (to calculate permutations)
inline int raise_to(int m, int n)
{
    if(0 > n)
        return -1;
    if(0 == m)
        return (0 == n) ? -1 : 0;
    int value = 1;
    while(0 < n)
    {
        value *= m;
        --n;
    }
    return value;
}

} // end namespace utils

#endif // Included_Utils_H 


tee.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#ifndef Included_Tee_H
#define Included_Tee_H

// Slghtly modified version of header by vijayan. The two tweaks are:
// - the header guards
// - the wchar_t typedefs
//
// http://www.daniweb.com/software-development/cpp/threads/326447/tee-command
// Tee command? ("2 Years Ago", from June 2013)
// vijayan121 "Posting Virtuoso"
//
// http://forums.devx.com/showthread.php?175218-streambuf
// streambuf (11-25-2010)
// vijayan

#include <iostream>

template < typename CHAR_TYPE,
           typename TRAITS_TYPE = std::char_traits<CHAR_TYPE> >
struct basic_teebuf : public std::basic_streambuf< CHAR_TYPE, TRAITS_TYPE >
{
    typedef std::basic_streambuf< CHAR_TYPE, TRAITS_TYPE > streambuf_type ;
    typedef typename TRAITS_TYPE::int_type int_type ;

    basic_teebuf( streambuf_type* buff_a, streambuf_type* buff_b )
            : first(buff_a), second(buff_b) {}

    protected:
        virtual int_type overflow( int_type c )
        {
            const int_type eof = TRAITS_TYPE::eof() ;
            if( TRAITS_TYPE::eq_int_type( c, eof ) )
                return TRAITS_TYPE::not_eof(c) ;
            else
            {
                const CHAR_TYPE ch = TRAITS_TYPE::to_char_type(c) ;
                if( TRAITS_TYPE::eq_int_type( first->sputc(ch), eof ) ||
                    TRAITS_TYPE::eq_int_type( second->sputc(ch), eof ) )
                        return eof ;
                else return c ;
            }
        }

        virtual int sync()
        { return !first->pubsync() && !second->pubsync() ? 0 : -1 ; }

    private:
        streambuf_type* first ;
        streambuf_type* second ;
};

template < typename CHAR_TYPE,
           typename TRAITS_TYPE = std::char_traits<CHAR_TYPE> >
struct basic_teestream : public std::basic_ostream< CHAR_TYPE, TRAITS_TYPE >
{
    typedef std::basic_ostream< CHAR_TYPE, TRAITS_TYPE > stream_type ;
    typedef basic_teebuf< CHAR_TYPE, TRAITS_TYPE > streambuff_type ;

    basic_teestream( stream_type& first, stream_type& second )
         : stream_type( &stmbuf ), stmbuf( first.rdbuf(), second.rdbuf() ) {}

    ~basic_teestream() { stmbuf.pubsync() ; }

    private: streambuff_type stmbuf ;
};

typedef basic_teebuf<char> teebuf ;
typedef basic_teestream<char> teestream ;

typedef basic_teebuf<wchar_t> wteebuf ;
typedef basic_teestream<wchar_t> wteestream ;

#endif Included_Tee_H 


Last edited on
A few of the results (from o/p file parser_test_results_msvc.txt)

As I don't know any of the languages which use ogoneks, I improvised.

input : "banana"
  2 "an"(s)
  4 permutation(s) expected

results : 4 permutation(s) returned
  Results[ 0] = "bąnąna" [a with ogonek count : 2]
  Results[ 1] = "banąna" [a with ogonek count : 1]
  Results[ 2] = "bąnana" [a with ogonek count : 1]
  Results[ 3] = "banana" [a with ogonek count : 0]

input : "Zoltan"
  1 "an"(s)
  2 permutation(s) expected

results : 2 permutation(s) returned
  Results[ 0] = "Zoltąn" [a with ogonek count : 1]
  Results[ 1] = "Zoltan" [a with ogonek count : 0]

input : "animal banana"
  3 "an"(s)
  8 permutation(s) expected

results : 8 permutation(s) returned
  Results[ 0] = "ąnimal bąnąna" [a with ogonek count : 3]
  Results[ 1] = "animal bąnąna" [a with ogonek count : 2]
  Results[ 2] = "ąnimal banąna" [a with ogonek count : 2]
  Results[ 3] = "animal banąna" [a with ogonek count : 1]
  Results[ 4] = "ąnimal bąnana" [a with ogonek count : 2]
  Results[ 5] = "animal bąnana" [a with ogonek count : 1]
  Results[ 6] = "ąnimal banana" [a with ogonek count : 1]
  Results[ 7] = "animal banana" [a with ogonek count : 0]
Last edited on
Wow!!! It'll take me a bit to go through your code to see how it works. I like how you have an output file that shows the results, permutations, etc. I'll poke through it and see what I can learn :).

As far as the idea with the combo boxes, the reason I suggest it is because every single word will need to be parsed. Some orthographies use the "d" and "l" (lowercase L) characters to represent the "a" as in "father". Others use the "f" to represent the "ng" sound as in "sing". Others use the "g" for the "th" or "eth" sound. Or maybe they use an "e" where we use an "i" for the "ee" sound as in "peek". So you can see, it isn't only the ogoneks that I have to deal with but easily half of the characters in each word can have different or multiple outputs. For example, the "g" I mentioned above cranks out either a "th" or "eth". I also have the "h" character giving outputs with an "h" as well as an "x" (which represents a sound like the "ch" in German). There is a lot of character swapping going on and double outputs where the sound could go either way (the a or a with ogonek being a good example).

That is why the idea of combo boxes sounds appealing. If I could input the text from these documents and have each word represented by a combo box that is populated by each possibility, I can then select the best one for each word and make it somewhat readable for us. Does that make sense? Plus with combo boxes, I can always select a different permutation if I need to as I work with the text.

And this reminds me again...I still need to make that post about combo boxes!
That is why the idea of combo boxes sounds appealing.

I wasn't saying you shouldn't use combo boxes.

But it's more efficient it you just use one and move it to where it's required, displaying the rest of the string using an Edit control (though RichEdit might appeal, as it supports colour.) That is, the combo box moves to the word you click on (you handle WM_LBUTTONDOWN and do a hit test to see if the word has alterntive spellings, it so you relocate the combo box there and populate it with the details of the selected word.) This will be more efficient if you're talking about loads and loads of words with alternative spellings. See the spellchecker apps for how this kind of approach hangs together.

It sounds a bit like the way Google translate allows you to select from possible translations?

Andy

PS I replaced the implementation of count_substr() (in utils.h above) about 5 mins ago -- 22:25 UTC, 29 June 2013 -- as it had a bug it; it was assuming the substring was 2 chars long...)
Last edited on
I like the sound of that more than row after row of clunky combo boxes. I think I see where you are going with that and have a rough idea in my head on how it could be done. But just to be clear, an edit control can be made into a larger control so it isn't just one line but a large portion of the window, right? Like some sort of word processor? Is there already a function in place to overlay a combo box over a string or would I have to attempt it from the ground-up?

I'll look into the spellchecker stuff to see how it works. I haven't messed around with Google translate (ironic, huh? LOL) but I think I see what you're saying. But speaking of translate, I do have some thoughts on maybe tying this into some sort of language database to check for possible matches once the text has been parsed. But I'll get the parsing working first :).

I haven't heard of Richedit but your mention of color definitely caught my attention!

I also have an idea to work with phonetic attempts at spelling historic words and names. For example, something like "wah-doe-nar-bee" (that's just a random jumble but you get the idea) could be parsed. What I'd like to do is see if I can figure out how to do this dynamically from within the program where rather than hardcoding matches, the program could be told by the user what syllables, strings, etc. could become and then output accordingly. I might do the same thing with characters. That way this program could be flexible for pretty much anyone who wants to use it. I've been thinking about this for a week now and I think I'll start trying to hammer the code out. It may be a little ambitious for me but you learn best when you get pushed just outside your comfort zone :). I'm learning that simple programs from simple ideas can balloon up very quickly!!
But just to be clear, an edit control can be made into a larger control so it isn't just one line but a large portion of the window, right?

That's exactly what Notepad.exe's window is: a big Edit box.

(If you use Spy++.exe to look at the Notepad's window structure, you will see it's a frame window with two child windows: an Edit control and a statusbar.)

Is there already a function in place to overlay a combo box over a string or would I have to attempt it from the ground-up?

There's no standard Win32 call. But you could always borrow from an open source project (license permitting.)

I haven't heard of Richedit but your mention of color definitely caught my attention!

See below for a couple of refs.

(And, the standard Windows app, WordPad uses a RichEdit control in the same kind of was as Notepad uses an Edit control. You can use the Format/Font menu to set font, text size, text style, and color.)

I also have an idea to work with phonetic attempts at spelling historic words and names.

I'm sort of involved with an open source project (almost, because it's quiescent at the moment...) that has been looking at the use of phonetic-based approaches to spellchecking. The app looks for alternative spellings which give the same sound by (a) chopping each word up into all possible substrings, (a) matching these substrings against a list of possible phonemes, (c) constructing a set or possible phonetic spelling, and finally (d) using the phonetic spelling to find possible words in a dictionary (by phonetic spelling.)

The dictionary, including phonetic data, is stored in a custom data file. But I have wondered about moving it to use SQlite instead.

I'm learning that simple programs from simple ideas can balloon up very quickly!!

Feature creep!?!

Feature Creep
http://search.dilbert.com/comic/Feature%20Creep

You need to fix a set of features for you first release and then stick to them. Add all new ideas to the "to do in the future list". When your first release goes golden, the you can prioritise the list to decide on what feature to add to version 2 (or 1.1?)

Andy

About Rich Edit Controls
http://msdn.microsoft.com/en-us/library/windows/desktop/bb787873%28v=vs.85%29.aspx

e.g.

Fast HTML syntax highlighting with the Rich Edit control
http://www.codeproject.com/Articles/13581/Fast-HTML-syntax-highlighting-with-the-Rich-Edit-c

SQLite
http://www.sqlite.org/

Last edited on
Looks like I am "feature creeping" my own project LOL! This started out as a simple parser and that program works just fine (thanks to the help I've received on the forums here). I think I do need to lay out some features and stick with them as you suggest. Right now my brain is awash with ideas and possibilities. This happens every time I get a "surge" in my programming ability. New possibilities open up to me and each has their own host of issues as I try to figure them out! I think I'll take a step back and do some designing/planning.

I've started tinkering with a full-blown windows application project in Visual Studio last night (my other program that you have been helping me with was just a small dialog box). I've found a RichTextBox in the toolbox. Is that the same as the Rich Edit you are talking about? I am thinking so but your link is different than the one I was looking at about the RichTextBox so I'm not 100% sure.

I've also been thinking more about the possibility of having the user set up what output goes with specific characters and how that should be stored. You mentioned using a custom data file. My project doesn't sound as extensive as yours so would something like a data file work for me? As of now, my only experience with writing to a file was a tutorial a long time ago that had me output "I am writing to a text file" to a file and be able to retrieve it. This will definitely be a learning experience for me :). I've also been recently introduced to databases and no doubt will want to look into the SQLite you mentioned.

On the subject of having the user set up matches, I have an idea on how that could be done. I'll post my code for it in a bit once I hammer it out.
Right now my brain is awash with ideas and possibilities.

Be sure to record them all, while the muse grabs you!!

Is that the same as the Rich Edit you are talking about?

Nope. The RichTextBox is the managed counterpart to the Rich Edit control. (Actually, there are two of them: the Windows Forms version, and the Windows Presentation Framework one.)

From your earlier question about ComboBox, etc, I took it that you were using the old-school Win32 API -- things like CreateDialog, PostMessage, etc. The Rich Edit control fit in here.

If you want to use the RichTextBox then you have to work with C++/CLI, a managed language based on C++, which has it's own way of doing things.

This tutorial article shows RichTextBox being used from C++/CLI

Windows Controls: The Rich Text
http://www.functionx.com/vccli/controls/rtb.htm

would something like a data file work for me?

Well, it really depends on your data. But the answer is very probably yes, with the right file formatting.

In the end you can switch to an alternative storage mechanism in the future if it doesn't live up to expectations, with the help of a tool to convert the old data to the new format.

So, what sort of information do you need to store??

"I am writing to a text file" / SQLite

SQLite might be best left until later. You should be able to live with a data file for your first version.

I take it you've already seen the tutorial on this site?

Input/Output with files
http://www.cplusplus.com/doc/tutorial/files/

Andy

C++/CLI
http://en.wikipedia.org/wiki/C%2B%2B/CLI
Last edited on
Right now all I am thinking that I need to store is the created objects that would store a user's character matches (for example, they would be able to create a system of character matches). For example, let's say I am trying to make this program flexible so other users can create their own swapping systems. They can create their own system, give it a name, and match characters that are inputted with characters that they want to have outputted. So basically just a name and an array of objects that store two strings. I'm thinking a data file will work fine for now.

I think I have taken care of the global variable issue you pointed out a few posts back. I created a Parser class whereas before it was all done in my main.cpp file.

I also tried messing with a full-blown windows application (when I created the project) and that got me even more confused as to how everything was working so I think I'll stick with my little dialog box for now :).
Last edited on
Right now here is the idea that I have for storing user's character matches. I'm thinking that I can create a class of character matches that will perhaps hold the name of the match (I would think that would be optional) and two wstrings. The first one would be what to look for and the other would be what the output would be. Then I am thinking that another class called Orthography would be a collection of those character match objects along with a name for that collection. As each orthography (just calling it that for the lack of a better term) gets created by the user, it is added to a dropdown menu to be selected when the user wants to parse some text.

Does that sound feasible? Or would that be going it about it the hard way?
Pages: 12