Parsing a string into tokens

I need some way of parsing a string into tokens. I'm reading the example from a file.

EX:
BEGIN
a := 1 + (a) - 10;
END

where a, :, = , +, -, (,) ,; are all the tokens
Tokenizing in C is quite simple:

1. Create a copy of the string (because it needs to be writeable memory).
2. Use strtok_s() or wcstok_s() (latter if you are using wide characters, and you should be) to tokenize.

I haven't tokenized in C++, but I think I once saw someone using the extraction operator from a string stream. I think you can tell the string stream which delimeters you want to use and then simply extract strings, which will be the tokens. Look it up.
if you know that cin already tokenizes on whitespaces, what would change to make it tokenize on any character?
In
1
2
3
BEGIN
abc := 9999 + (abc) - 1234 ;
END


Would abc be one token or three separate tokens a, b, c?

Would 9999 and 1234 be one token each or four tokens each?

And is := one token or two tokens : and =

In short, would this be fine?
1
2
3
BEGIN
ab:c = 99(99 + abc - )12;34 
END
abc is one token, but : and = are separate tokens, also 9999 is one token.
So you need to be able to recognize the tokens in the string first before you can start thinking about how to split the string into tokens.

Start with something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
enum token_type { IDENTIFIER /* eg. 'abcd' */, CONSTANT /* eg. '12345' */, 
                  OPERATOR /* eg '+' */, TERMINATOR /* eg. ';' */ 
                  /* ... etc ... */, INVALID = -1 }; 

// what is the type of the token at the start of string str?
// invariant: str does not contain leading white spaces 
// and how long (how many characters) is the token?
// note: the match is a 'greedy' match - match as many characters as possible
// eg: str contains 'abc12:= 78 ;'
// result: token_type is IDENTIFIER, token_length is 5
// 
// eg: str contains '486)+'
// result: token_type is CONSTANT, token_length is 3
token_type recognize( const std::string& str, /* out */ std::size_t& token_length ) ;


Make sure it is working correctly, and then we can move on to extracting the token.

Hint: #include <regex>
http://en.cppreference.com/w/cpp/regex
Last edited on
Is there a way to scan over an enum type to see if it matches a string?
> Is there a way to scan over an enum type to see if it matches a string?

A lookup table could be used:

1
2
3
enum colour_t { BLACK, RED, GREEN, BLUE } ;

const std::string colour_names[] =  { "BLACK", "RED", "GREEN", "BLUE" } ;



If the question meant: is there a way to see if a string is an identifier?

Something like this would check if a string is a valid C++ identifier:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <string>
#include <cctype>

inline bool is_valid_first_char( char c )
{ return std::isalpha(c) || ( c == '_' ) ; }

inline bool is_valid_char( char c )
{ return is_valid_first_char(c) || std::isdigit(c) ; }

bool is_valid_identifier( const std::string& str )
{
    if( str.empty() || !is_valid_first_char( str[0] ) ) return false ;
    for( std::size_t i = 1 ; i<str.size() ; ++i )
        if( !is_valid_char( str[i] ) ) return false ;
    return true ;
}

Last edited on
Here's what i've got so far. It splits on every character as a token...

#include <iostream>
#include <string>
#include <fstream>
#include <vector>

using namespace std;
string token_type[] ={"IDENTIFIER","CONSTANT","OPERATOR","KEYWORD","TERMINATOR"};
string special[] = {"(", ")", ":=", ";", ","};
vector <string> mystring;

string myoperator[] = { "+", "-"};
string mykeyword[] = {"BEGIN", "END", "READ", "WRITE"};
bool compare(string);
ifstream indata;
ofstream outdata;

int main()
{
string str="", line="";
cout << "Enter name of file: ";
cin >> str;
string temp, str1; // Enter the file name
indata.open( str.data() ); // Open file
cout << endl;
int token_length=0;
while (!indata.eof() )
{
indata >> line;
if (! compare(line))
{

for (int r=0; r <line.size(); r++)
{
temp =line.substr(r,1);

if( !compare(temp) )
{
str1 += temp;
}
else
{
mystring.push_back(str1);
str1 = "";
mystring.push_back(temp);
}
}
mystring.push_back(str1);
str1 = "";
}
else
mystring.push_back(line);
}
for (int s=0; s< mystring.size(); s++)
{
cout << mystring[s] << endl;
}
return 0;
}

bool compare(string line)
{
for (int i=0; i <2; i++)
{
if (line == myoperator[i])
{
return true;
}
}
for (int j=0; j <4; j++)
{
if (line == mykeyword[j])
{
return true;
}
}
for (int k=0; k <5; k++)
{
if (line == special[k])
{

return true;
}
}
return false;
}
Topic archived. No new replies allowed.