Writing a LALR parser

Pages: 12
I don't think I made my point clear:

I was suggesting you keep your code exactly as it is, with only a miniature change: when you do the reductions, instead of carrying out *the actual* operations implied by the reductions, you can "virtually carry out the operations" by making a tree. It will still be O(n) algorithm! easy money!
Nope. I completely understood what you said.
Ok. Has this approach come to your mind? Create an error token:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
enum TokenType{
	LPARENError=129,
	null=0,
	END=1,
	INTEGER='0',
	PLUS='+',
	MINUS='-',
	MUL='*',
	DIV='/',
	POW='^',
	LPAREN='(',
	RPAREN=')',
	expr=128
} type;


And then you add a rule:
this->rules.push_back(Rule(Token::LPARENError, Token::null,3,Token::LPAREN,Token::LPAREN, Token::RPAREN));

[Edit]: I am not sure you can cover with finitely many such rules all possible mistakes; I thought of creating an infinite series of error examples for which no finite error rule set will work but I couldn't for the last 20 min. It might be possible to figure a finite set of "error catching" rules...

[Edit2]: In the value of the ErrorToken you can of course keep in some form the location where the error occured.
Last edited on
It hadn't occurred to me, I must admit, but a full error checking for a reasonably complete expression parser would create too many rules.

I stick to my opinion that my approach trades simplicity for accurate error checking in relation to a state machine. If you want accurate error checking, it's much faster to implement the parser as a state machine, which will do all the error checking as it executes the input.

EDIT: The location of the error is easily determined because reduce should stop reduce when it find that the stack has been reduced to an error (which will happen as soon as the guilty token comes), and then run_state() can check whether the top of the stack is an error.
Last edited on
Sorry about the double post, but I thought this warranted it.

A good source on these and other types of parsers is Compilers - Principles, Techniques, and Tools Second Edition (aka the purple dragon book), published by Addison Wesley.

Thank God for the Internet, because that book is impossible to get where I live. Computer Science is not a very popular career around these shores. Even if it was obtainable, it would be a translation to Spanish. If there's something I hate, it's translated jargon. Ugh. We have at least three different words for the "linked" in "linked list", and none of them are related to links in a chain.
Last edited on
This a bit off topic but I cant help it...

Compilers - Principles, Techniques, and Tools Second Edition, formula (11.1)
"The iterations of a d-deep-loop can be represented mathematically as
{i \in N^d | B i+b>= 0 }"

the number of the elements of this set is directly related to the program I that made me study C++ (I am pretty sure it is in fact in one-to-one correspondence).

btw. the book has it wrong the book it says Z in place of what I wrote as N. (Z denotes integers, N denotes non-negative integers). It is obviously wrong since if j is a solution to B j>= 0 and i is any element for which B i+b>= 0, then i+ (j times any positive integer) is obviously a solution, so that set would be infinite if it were Z instead of N... [Edit] unless there are no such j...

[Edit:] Whoops... the book is right! The above observation of mine simply means there is no such j for which B j>= 0 ...
Last edited on
helios: I finally got some time to really look at your program. Very niiiice! I have a couple of comments.

* You almost always use "this->" to refer to member variables. I notice that sometimes you use it for disambiguation, but mostly it's just an extra 6 characters (IMO), although it does highlight that they are member variables. Still, I'd rather not see this-> all over the place. There are other ways to highlight and disambiguate member variables, such as ending them with an underscore.

* You're missing va_end in Rule's ctor. Actually, you could forgo the variable parameter list and get proper type-checking by using defaults. (I've changed null to EMPTY so that it's in uppercase like the rest.)

1
2
3
4
5
6
7
    Rule( const Token& to, const Token& la,
          TokenType a, TokenType b = EMPTY, TokenType c = EMPTY )
      : reduces_to( to ), lookahead( la ){
        constraints.push_back( a );
        if( b != EMPTY ) constraints.push_back( b );
        if( c != EMPTY ) constraints.push_back( c );
    }

* I believe read() can be simplified (I've also changed it to handle doubles, which you may or may not want):

1
2
3
4
5
6
7
8
9
Token Parser::read(){
    char ch;
    if( !(stream >> ch) ) return END;
    if( ch != '.' && !isdigit( ch )) return ch;
    stream.putback( ch );
    double d;
    stream >> d;
    return d; // Requires a Token ctor that takes a double
}

* In reduce(), if the "impossible" cases are actually impossible then you should get rid of them and add a default case with an error message.

* Parser has detailed knowledge of Rule, Token and TokenType. They probably all belong inside Parser. In particular, if TokenType was outside of Token, it would save a lot typing (and eyestrain). Changing this and getting rid of the variable argument list really improves the look of initializeRules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void Parser::initializeRules(){
    rules.clear();
    rules.push_back( Rule( EXPR,  EMPTY, VALUE                 )); // 0
    rules.push_back( Rule( EXPR,  EMPTY, LPAREN, EXPR,  RPAREN )); // 1
    rules.push_back( Rule( EXPR,  EMPTY, PLUS,   EXPR          )); // 2
    rules.push_back( Rule( EXPR,  EMPTY, MINUS,  EXPR          )); // 3
    rules.push_back( Rule( EMPTY, POW,   EXPR,   POW,   EXPR   )); // 4
    rules.push_back( Rule( EXPR,  EMPTY, EXPR,   POW,   EXPR   )); // 5
    rules.push_back( Rule( EXPR,  EMPTY, EXPR,   MUL,   EXPR   )); // 6
    rules.push_back( Rule( EXPR,  EMPTY, EXPR,   DIV,   EXPR   )); // 7
    rules.push_back( Rule( EMPTY, MUL,   EXPR,   PLUS,  EXPR   )); // 8
    rules.push_back( Rule( EMPTY, DIV,   EXPR,   PLUS,  EXPR   )); // 9
    rules.push_back( Rule( EXPR,  EMPTY, EXPR,   PLUS,  EXPR   )); // 10
    rules.push_back( Rule( EMPTY, MUL,   EXPR,   MINUS, EXPR   )); // 11
    rules.push_back( Rule( EMPTY, DIV,   EXPR,   MINUS, EXPR   )); // 12
    rules.push_back( Rule( EXPR,  EMPTY, EXPR,   MINUS, EXPR   )); // 13
}

* The precedence of POW seems to be a little off. Try this, which should give 2 but instead gives 4 (although I may have screwed something up in my current version):

 
    Parser evaluator( "1 + 1 ^ 2" );


* It would be nice if you didn't use tabs (the devil's spacing). And I'd like to see more horizontal spaces as I find it somewhat difficult to read. But then again, I probably use too much space! And the forum should set it's tab size to 4 !!!

Anyway, thanks for posting this. I really enjoyed reading it.
* You almost always use "this->" to refer to member variables. I notice that sometimes you use it for disambiguation, but mostly it's just an extra 6 characters (IMO), although it does highlight that they are member variables. Still, I'd rather not see this-> all over the place. There are other ways to highlight and disambiguate member variables, such as ending them with an underscore.
I have gone through the experience of looking at a call and not knowing whether the function was local or global, nor where its parameters were declared. I always use the unambiguous this->.

(Changing variadic parameters to default parameters.)
Yeah, I hadn't thought of that. *Eyeroll*
Now tell me what you'd do if you had to parse the trinary operator. What about a nonary operator?

* In reduce(), if the "impossible" cases are actually impossible then you should get rid of them and add a default case with an error message.
Inline documentation.

* The precedence of POW seems to be a little off. Try this, which should give 2 but instead gives 4 (although I may have screwed something up in my current version):
As is stated earlier, I forgot to add more rules for MUL and DIV, and accidentally gave POW the lowest precedence.


I was hoping I'd have something interesting to read, not this.
I mean, seriously. Criticizing the style? Criticizing the indentation character?

I'll give you some advice which will probably help you next time you're reading someone else's code: programmers are not stupid (they just happen to be holding the idiot ball, sometimes), if they use some style, it's probably for a reason. When you tell one "why do you do this instead of that, which is what I do?", you're doing something not too different from calling them stupid. I know. I've done it to some of my friends with that purpose in mind.
If you have nothing to say about the code at a high level, then don't say anything But whatever you do, don't start complaining about the coding style and definitely don't complain about the indentation style if you don't want to start a holy war. Yours is also an unreadable mess, from their perspective.
Last edited on
Unbelievable! I was the only person to take the time to go over your code in detail and you don't even thank me. Instead it's just "Waaaahh!! You criticized my spacing!!!!" You will never learn anything with that attitude.
want to start a holy war
about the coding style


The only valid reason for a holy war that I have ever heard :))
Well, how am I supposed to react? I asked for someone (in particular, someone who's had some experience with parsers) to take a look at the design in my first piece of code. Whether it would make it harder to add more operands and that sort of thing.
Things such as what character is used to indent, where the classes are defined, or whether this-> is used, are completely irrelevant to the subject of writing a LALR parser.

When talking about design on a high level, the rule of thumb is: if you can't make the same comment if the code was written in a completely different language, then it's irrelevant.

You will never learn anything with that attitude.
Well, I've come this far while staying the same, so I think it's safe to say I can keep going.
closed account (S6k9GNh0)
I know I probably don't belong to this topic anymore, but I noticed helios's this-> frenzy as well. Just wanted to say that the most common way to fix this is use a prefix of m such as mMyInt to note that it's a member of a class. and then use the g prefix to note that it's global. __ and capital is used for constants. And a couple others I can't currently remember.
Last edited on
closed account (z05DSL3A)
As helios said, discussion on style is not what was asked for.
Why are people complaining about the style? I mean...seriously...it's not like he has code like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Token Parser::read()
{
char ch
;
    if( !(stream >> ch) ) return 
END
;
    if( ch != '.' && !isdigit( ch )) return 
ch
;
    stream.
putback( ch )
;
    double 

d
;
                         stream 
>> d
;
    return 

d
; // Requires a Token ctor that takes a double
}


And for the record, I use this-> on all my member variables. Mainly for auto completion, but also because I like not be ambiguous.
the weird thing is, firedraco's code actually ain't that unreadable...
You have [all] fallen to the discussion of the semantics of the word 'unreadable', and by doing so you [all] strayed from the topic of parsers.
Topic archived. No new replies allowed.
Pages: 12