Today's Simple Interpreter thread inspired me to write one.
Since this is the first time I do something like this (well, actually, I had written an expression parser with lookahead before, but it was rather a mess, so I ended up scrapping it), I would like some feedback on how I'm doing. Particularly whether my design will make things more difficult later on when I start adding more operators.
I intend to only keep one Token at a time (not counting the stack), in the future, so don't bother pointing out the std::deque.
void NonTerminal::reduce_expr()
{ if (!this->branches.size())
return;
if (this->branches.size()<3)
this->leaf=this->branches[0].leaf;//<-I don't get this one...
//is it possible that you are forbidding expressions such as "-1"?
//if you mean here that if you have less than two leaves, then you
//must have one leaf, then I would suggest putting one reassuring
//assert(this->branches.size()==1);
else
{ this->leaf.type=Token::INTEGER;
switch (this->branches[1].leaf.type)
{ case Token::PLUS://my main comment is here. Why not recursion?
//suggestion follows
if (!this->branches[0].reduced)
this->branches[0].reduce_expr();
if (!this->branches[2].reduced)
this->branches[2].reduce_expr();
//end of suggestion
this->leaf.intValue=this->branches[0].leaf.intValue+this->branches[2].leaf.intValue;
break;
case Token::MINUS://you can use recursion here too
this->leaf.intValue=this->branches[0].leaf.intValue-this->branches[2].leaf.intValue;
break;
default:;
}
}
this->reduced=1;
this->branches.clear();
}
Also, why do you use int for reduced instead of bool? Do you plan on assigning more than two values to reduced?
I'm interested in the state machine aspect. Also, my last parser was recursive, so I thought I'd try something different, this time. Plus, a recursive parser is limited by the size of the stack, while an iterative isn't.
Line 5: Sign inversion is a completely different operand from subtraction. Aside from the obvious, sign inversion is right-associative, while subtraction is left-associative. So in short, yes, I am forbidding it.
Lines 13-19: reduce_expr() assumes that the branches in the current node are already reduced.
And I don't know where you're looking, but NonTerminal::reduced is a bool.
After my previous post, I read a bit of Bison's manual and realized I need either the stack or the tree. Since the stack is simpler to implement, that's what I'll use. Well, actually, I'll use a vector as a stack, since I'll have to take a look at elements other than the top one.
I can implement both precedence and associativity. Although I don't know how lack of associativity works.
It doesn't run on a state machine. Instead of adding more states, I add more rules. For example, an exponentiation can only be reduced if the next token is not ^.
It's still technically a LALR, though, since is looks ahead and runs for left to right (or at least I think it is).
EDIT: By the way, initializeRules() looks good with a 4 columns tab.
Let us exchange ideas :) I will be a bit delayed with replies though cause Real Life Work is calling me :((((
[Edit: Fixed mistakes. This is 3rd version already]
I haven't really tested most of it (except for the expression i put in there), but unless I messed it up it should support brackets, +,-,/,*. It chops the expression non-recursively, but computes recursively cause I was too lazy to introduce a total order to the generated tree :(
You supposed wrong, I know release disables assert. That is why I left functionality out for real error handling: bool ExpressionLeaf::SplitInTwo(int operationIndex) for the time being returns only true. Since it is the memory allocation unit it is where errors should be raised, by returning false. However, I did not program anything to handle errors yet, so it better be left with assert and used with Debug compiling.
0. I chose the generate-tree approach. However, I store the ExpressionLeaf* of my tree in a vector, so it is a "hybrid" approach.
1. I keep the original expression's string in memory. All other references to it are made by providing starting index (int leftIndex) and ending index (int rightIndex).
2. I realize a simple routine which computes for a given open bracket its counterpart closing bracket int Expression::FindIndexClosingBracket(int IndexOpeningBracket)
3. I generate the tree by setting simple rules for splitting an expression.
***Parsing***
3.0 Start.
3.1 If an expression starts with an open bracket whose closing bracket is the expression's end, I "remove" the brackets(shift leftIndex and rightIndex) and go back one step; else I proceed.
3.2 If an expression doesn't fall in the category described in 3.1, it is obvious that either 1) it is an atomic expression (can't do anything with it - say, a constant) or 2) there must be an operation token some place that is not enclosed by brackets. The cycle after //label: find operation not enclosed by brackets finds that operation token if it exists.
3.2.1 Important note. When finding intermediate operation tokens, one must be careful for the order of precedence of operators. For example, in a*b+c, a valid split of the expression is
add(mult(a,b),c), which means that the in step 3.2 we are allowed to pick only the '+' token. That is what all theint Tokens::AssociativityWeakness(char x, bool isUnary) jazz is all about.
3.3 a) If an intermediate operation token is found, we split the expression with bool ExpressionLeaf::SplitInTwo(int operationIndex). The newly created expressions (ExpressionLeaf) are recorded in our global Expression.
3.3 b) If no intermediate operation token is found ("atomic expression case") we "mark" the expression as reduced by returning with a true.
4. We execute step 3 to all non-atomic expressions. Note that the function return values are set so that one doesn't actually have to keep a bool isReduced member of ExpressionLeaf.
***
So that was the parser. Once you have the tree structure, evaluating it recursively is a piece of cake. (int ExpressionLeaf::ComputeRecursively())
Well, it's very simple, really.
The main components are the rule list and the reduce() function.
Each rule in the rule list specifies what will the top of the stack be reduced to if it matches a list of constraints and the lookahead token matches a type. The rule may also specify that the lookahead token can be of any type and that the stack should not be reduced if it matches that rule.
For example, one of the rules (rule 10) says that if the top of the stack contains an expression, a +, and another expression and the lookahead token is a *, then the stack should not be reduced. On the other hand, another rule (rule 6) says that a stack containing expr '*' expr should be reduced to an expr regardless of what the lookahead token is.
The reduce function finds which rule to use to reduce the stack by looking for [the biggest rule that matches the top of the stack and the lookahead token]. If there are two rules (i.e. both rules have constraints of the same size) that meet this condition, it will choose the first one it finds, so the order of the rules is crucial.
If it doesn't find a match or the match specifies that the stack should not be reduced, reduce quits. Otherwise, the rule is executed, the stack is popped and then pushed back with the new non-terminal.
run_state() (actually, I should rename the function) is pretty self-explanatory.
Example:
2^2^2+1+2*3
INTEGER POW INTEGER POW INTEGER PLUS INTEGER PLUS INTEGER MUL INTEGER
Stack: <empty>
Can't reduce further
Shift INTEGER
Stack: INTEGER
Reduce with (expr -> INTEGER): {expr|INTEGER}
Can't reduce further
Shift POW
Stack: expr POW
Can't reduce further
Shift INTEGER
Stack: expr POW INTEGER
Reduce with (expr -> INTEGER): expr POW {expr|INTEGER}
Can't reduce with (expr -> expr POW expr) because lookahead is POW
Can't reduce further
shift POW
Stack: expr POW expr POW
Can't reduce further
Shift INTEGER
Stack: expr POW expr POW INTEGER
Reduce with (expr -> INTEGER): expr POW expr POW {expr|INTEGER}
Can reduce with (expr -> expr POW expr) because lookahead is not POW: expr POW {expr|expr POW expr}
Can reduce with (expr -> expr POW expr) because lookahead is not POW: {expr|expr POW expr}
Can't reduce further
shift PLUS
Stack: expr PLUS
Can't reduce further
Shift INTEGER
Stack: expr PLUS INTEGER
Reduce with (expr -> INTEGER): expr PLUS {expr|INTEGER}
Can't reduce with (expr -> expr PLUS expr) because lookahead is MUL
Can't reduce further
shift MUL
Stack: expr PLUS expr MUL
Can't reduce further
shift INTEGER
Stack: expr PLUS expr MUL INTEGER
Reduce with (expr -> INTEGER): expr PLUS expr MUL {expr|INTEGER}
Reduce with (expr -> expr MUL expr): expr PLUS {expr|expr MUL expr}
Can reduce with (expr -> expr PLUS expr) because lookahead is not MUL: {expr|expr PLUS expr}
Can't reduce further
Look ahead is END
The stack length is exactly 1, so were no errors.
My approach is simplistic, which is a plus, but unlike a state machine, it can only detect that some error has occurred, not where it occurred, because the error detection is a single at the end of execution.
However, it's good enough to generate at least a simple parser from Yacc-esque rules. Right now I'm working on how to do that. One of the problems I need to solve is "how do I know (expr '+' expr) and (expr '*' expr) are in conflict when there is no extra precedence information?"
Wow. I tried learning the concept of LALR parsers and my brain hurts now. I'll save this for another day lol. It seems to be one of those things that take a bit to digest.
The first parameter to Rule::Rule() is what the rule reduces to. If Token::null is passed, the parser will not try to reduce. This is used to enforce precedence and associativity.
The second parameter is the constraint on the lookahead token. If Token::null is passed, there's no constraint.
The third parameter is the number of variadic parameters to follow.
From then on are the constraints that will be applied to the stack.
And yeah, you're right. I forgot to add some more rules to PLUS, MINUS, DIV, and MUL. a<anything other than ^>b^c==(a*b)^c, because I accidentally gave ^ the highest precedence. Luckily, fixing this is just a matter of copy-pasting and slightly editing a few lines.
aha... I think I finally got it: /*rule 1*/Rule(Token::expr, Token::null,3,Token::LPAREN, Token::expr, Token::RPAREN )
means:
if the last three tokens are [LPAREN, expr, RPAREN] (in this order) and if the lookahead is null = arbitrary or there is no lookahead token, then substitute [LPAREN, expr, RPAREN] with expr.
It's not the code itself that troubles me. It's simply the concept of the LR or LALR parser. Wikipedia gives a mediocre example on how it works. Here's a decent tutorial that puts it in much better terms:
It sounds more complicated than it actually is. A good introduction to the subject is generating a parser with Yacc or Bison (that's how I learnt it no more than a month ago).
Bison's manual also helps a great deal to understand how the parser works.
A suggestion to helios. It would be nice to be able to suggest to the user possible mistakes. So, it completely makes sense to build the tree structure underlying the parsed expression. (I have no clue how you would give error suggestions otherwise).
That will be very easy to implement on top of your code: in the reduce function, besides evaluating the expressions, you can also build the underlying expression tree structure (from the bottom up).
1 2 3 4 5 6 7 8 9
switch (rule_index){
/*....*/
case 6: //expr <- expr '*' expr
new_token.intValue=redex[0].intValue*redex[2].intValue;
//here add code to make a new node of a tree with left successor
//the left expression, right successor the right expression, and
//store operation token * in the new node.
/*etc.*/
}
This way, you will build the expression tree structure in a much nicer fashion than I do. This is in fact the main difference between my slow approach and yours - I build the tree "from the top down", and you build it "from the leaves up". Of course both approaches are correct, but mine is O(n^2) and yours is O(n) (where n is the size of the expression to be parsed), which will be quite a difference if n is 1000 :).
It will be nice for me to try to merge a tree structure in your code. I was thinking first of applying your approach to my code but I think you actually did the tougher part, so it would be quicker to just paste stuff to your code. If I find the time to do so I will post the result here (I will note the code I took from you, but will probably rename it to my tastes :).
What I like with your parser is that you actually have no trouble parsing expressions such as "-1" (which is an "unary" operation (i.e. takes only one argument)) - you just add some extra rules (rules 2 and 3 in your code). In the same way you will have no trouble parsing functions with more than two arguments.
Cheers!
P.S.
but your main is of type void, shouldn't it be int?
I don't know... *scratches head* Umm, why should it be int?
While I agree that it's not hard to replace the stack with a tree (I originally did that in the opposite direction for the sake of simplicity and efficiency), I still very much doubt you'll manage to get a more advance error checking into my design without throwing away some of the generality, which is the point of using rules. The problem with reduce() is that it's unaware of what the rules do. It's either able to reduce, or unable to reduce, and neither case necessarily mean there's been an error.
I'm currently trying to figure out how to generate a syntax table from a set of reduction rules (e.g. any of the rules that don't reduce to null in my code).
The Wikipedia article on LR parsers has helped somewhat. I would prefer if they used full names for terminals and non-terminals in compiler theory, rather than just single letters, though. It'd make things easier to follow.