Lexical analyzer c++

calling signature:
Token getNextToken(istream *in, int *linenumber);

Anybody familiar with Lexical analyzer? can anybody explain the problem?

Thanks

Assignment:
Note that any error detected by the lexical analyzer should result in the ERR token, with the
lexeme value equal to the string recognized when the error was detected.
Note also that both ERR and DONE are unrecoverable. Once the getNextToken function returns
a Token for either of these token types, you shouldn’t call getNextToken again.


The assignment is to write the lexical analyzer function and some test code around it.
It is a good idea to implement the lexical analyzer in one source file, and the main test program
in another source file.
The test code is a main() program that takes several command line arguments:
-v (optional) if present, every token is printed when it is seen
-sum (optional) if present, summary information is printed
-allids (optional) if present, a list of the lexemes for all identifiers should be printed in
alphabetical order
filename (optional) if present, read from the filename; otherwise read from standard in
Note that no other flags (arguments that begin with a dash) are permitted. If an unrecognized
flag is present, the program should print “INVALID FLAG {arg}”, where {arg} is whatever flag
was given, and it should stop running.
At most one filename can be provided, and it must be the last command line argument. If more
than one filename is provided, the program should print “TOO MANY FILE NAMES” and it
should stop running.
If the program cannot open a filename that is given, the program should print “UNABLE TO
OPEN {arg}”, where {arg} is the filename given, and it should stop running.
The program should repeatedly call the lexical analyzer function until it returns DONE or ERR. If
it returns DONE, the program proceeds to handling the -mci and -sum options, if any, and then
exits. If it returns ERR, the program should print “Error on line N ({lexeme})”, where N is the line
number in the token and lexeme is the lexeme from the token, and it should stop running.
If the -v option is present, the program should print each token as it is read and recognized, one
token per line. The output format for the token is the token name in all capital letters (for
example, the token LPAREN should be printed out as the string LPAREN. In the case of token
IDENT, ICONST, and SCONST, the token name should be followed by a space and the lexeme
in parens. For example, if the identifier “hello” is recognized, the -v output for it would be ID
(hello)

If the -sum option is present the program should, after seeing the DONE token and processing
the -allids option, print out the following report:
Total lines: L
Total tokens: N
Total identifiers: I
Total strings: X
Where L is the number of input lines, N is the number of tokens (not counting DONE), I is the
number of IDENT tokens, and X is a count of the number of SCONST tokens.
If N is zero, no further lines are printed.
If the -allids option is present, the program should, after seeing the DONE token, print out the
string IDENTIFIERS: followed by lexemes for all of the identifiers, in alphabetical order,
separated by commas. If there are no identifiers, then nothing is printed.
PART 1:
● Compiles
● Argument error cases
● Files that cannot be opened
● Too many filenames
● Zero length file

============================
tokens.h
============================


#ifndef TOKENS_H_
#define TOKENS_H_

#include <string>
#include <iostream>
using std::string;
using std::istream;
using std::ostream;

enum TokenType {
// keywords
PRINT,
IF,
THEN,
TRUE,
FALSE,

// an identifier
IDENT,

// an integer and string constant
ICONST,
SCONST,

// the operators, parens and semicolon
PLUS,
MINUS,
STAR,
SLASH,
ASSIGN,
EQ,
NEQ,
LT,
LEQ,
GT,
GEQ,
LOGICAND,
LOGICOR,
LPAREN,
RPAREN,
SC,

// any error returns this token
ERR,

// when completed (EOF), return this token
DONE
};

class Token {
TokenType tt;
string lexeme;
int lnum;

public:
Token() {
tt = ERR;
lnum = -1;
}
Token(TokenType tt, string lexeme, int line) {
this->tt = tt;
this->lexeme = lexeme;
this->lnum = line;
}

bool operator==(const TokenType tt) const { return this->tt == tt; }
bool operator!=(const TokenType tt) const { return this->tt != tt; }

TokenType GetTokenType() const { return tt; }
string GetLexeme() const { return lexeme; }
int GetLinenum() const { return lnum; }
};

extern ostream& operator<<(ostream& out, const Token& tok);

extern Token getNextToken(istream *in, int *linenum);


#endif /* TOKENS_H_ */
A lexer splits data into tokens, or lexemes. For example, the sentence "Hello, World!", when passed to an imaginary lexer, might be converted into the sequence of four lexemes "Hello", ",", "World", "!".

You're supposed to take some input from an arbitrary std::istream (file or stdin) and lex (convert) it into tokens.
Last edited on
Remember that a TOKEN and a LEXEME are two different things

 • A token is (usually) an enum or other constant value.
   It is used to classify the kind of data in the lexeme string

 • A lexeme is a string containing the specific text for a class of token.

As an example, lets lex (or parse) the following example program:

1
2
3
4
5
x = 12
y = 3
if x < y
  then print "Never!"
  else print "The Universe is True"

The entire token,lexeme stream for that should be (adding blank lines for readability):

token enum	lexeme string         
IDENT		x
EQ		=
ICONST		12

IDENT		y
EQ		=
ICONST		3

IF		if
IDENT		x
LT		<
IDENT		y

THEN		then
PRINT		print
SCONST		"Never!"

ELSE		else
PRINT		print
SCONST		"The Universe is True"

DONE	

There are two things to notice:
 • The DONE token has an empty string as its associated lexeme.
 • The SCONST token’s lexeme string appears exactly as it did in the example program source code,
    including the double quotes. You don’t have to do that — you could parse the lexeme into the
    final string if you wish (replacing things like \" and \n with their actual character values).

The purpose of your assignment is to take a string (the example program code) and use getNextToken() to extract a token/lexeme pair one at a time, until you get the DONE token (or an ERR token).

Your assignment is also specific that you should save this information in some way. I would use a std::vector to keep a list of token/lexeme pairs.

 
  typedef std::vector <Token> token_list;


Then if you are asked (via command-line argument) to produce a list of all identifiers, you can do that with a simple filter (extract IDENT items) and sort (by lexeme).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void print_identifiers( const token_list& tokens )
{
  token_list identifiers;

  std::copy_if( 
    tokens.begin(), tokens.end(), 
    std::back_inserter( identifiers ), 
    []( const Token& token ) { return token.type == IDENT; }
  );

  std::sort( 
    identifiers.begin(), identifiers.end(), 
    []( const Token& a, const Token& b ) { return a.lexeme < b.lexeme; }
  );

  std::cout << "All identifiers:\n";
  for (auto p : identifiers)
    std::cout << "  " << p.lexeme << "\n";
}


Hope this helps.

[edit] *Disclaimer: I do not know exactly what your language is supposed to look like. For example, I don't know if you are supposed to have parentheses around the expression following an IF token. I also don't know if double-quotes are correct to delineate a string.

Do note, however, that you should not need to determine the validity of the program at this time, only the validity of the input stream. There is a difference. But again, I do not know the full details of your assignment.
Last edited on
Topic archived. No new replies allowed.