| helios (10126) | |||
|
Today's Simple Interpreter thread inspired me to write one. Since this is the first time I do something like this (well, actually, I had written an expression parser with lookahead before, but it was rather a mess, so I ended up scrapping it), I would like some feedback on how I'm doing. Particularly whether my design will make things more difficult later on when I start adding more operators. I intend to only keep one Token at a time (not counting the stack), in the future, so don't bother pointing out the std::deque.
| |||
|
Last edited on
|
|||
| tition (712) | |||
|
What was your motivation of chosing this type of evaluation instead of the obvious recursion? What I mean is, why didn't you do something along the lines
Also, why do you use int for reduced instead of bool? Do you plan on assigning more than two values to reduced?
| |||
|
Last edited on
|
|||
| helios (10126) | |
|
I'm interested in the state machine aspect. Also, my last parser was recursive, so I thought I'd try something different, this time. Plus, a recursive parser is limited by the size of the stack, while an iterative isn't. Line 5: Sign inversion is a completely different operand from subtraction. Aside from the obvious, sign inversion is right-associative, while subtraction is left-associative. So in short, yes, I am forbidding it. Lines 13-19: reduce_expr() assumes that the branches in the current node are already reduced. And I don't know where you're looking, but NonTerminal::reduced is a bool. After my previous post, I read a bit of Bison's manual and realized I need either the stack or the tree. Since the stack is simpler to implement, that's what I'll use. Well, actually, I'll use a vector as a stack, since I'll have to take a look at elements other than the top one. | |
|
Last edited on
|
|
| tition (712) | |
|
Hehe you inspired me to try write my own parser as well. I might also need it one day too! Will post it as soon as I am done (it is now 60% ready). Cheers! | |
|
|
|
| helios (10126) | |||
There we go.
I can implement both precedence and associativity. Although I don't know how lack of associativity works. It doesn't run on a state machine. Instead of adding more states, I add more rules. For example, an exponentiation can only be reduced if the next token is not ^. It's still technically a LALR, though, since is looks ahead and runs for left to right (or at least I think it is). EDIT: By the way, initializeRules() looks good with a 4 columns tab. | |||
|
Last edited on
|
|||
| tition (712) | |||
|
Here is my version :) Let us exchange ideas :) I will be a bit delayed with replies though cause Real Life Work is calling me :(((( [Edit: Fixed mistakes. This is 3rd version already] I haven't really tested most of it (except for the expression i put in there), but unless I messed it up it should support brackets, +,-,/,*. It chops the expression non-recursively, but computes recursively cause I was too lazy to introduce a total order to the generated tree :(
| |||
|
Last edited on
|
|||
| helios (10126) | |
|
Wait, you actually use assert() for error handling? I suppose you don't know that compiling for release disables all assert()s. EDIT: Oh, by the way. The containers in the standard library are not designed to be used as base classes. | |
|
Last edited on
|
|
| tition (712) | |
|
Fixed the mistakes I know. You supposed wrong, I know release disables assert. That is why I left functionality out for real error handling: bool ExpressionLeaf::SplitInTwo(int operationIndex) for the time being returns only true. Since it is the memory allocation unit it is where errors should be raised, by returning false. However, I did not program anything to handle errors yet, so it better be left with assert and used with Debug compiling.I didn't know that for the standard library... | |
|
Last edited on
|
|
| tition (712) | |
|
So can you explain more on your concept? As far as mine goes, it is the following: 0. I chose the generate-tree approach. However, I store the ExpressionLeaf* of my tree in a vector, so it is a "hybrid" approach. 1. I keep the original expression's string in memory. All other references to it are made by providing starting index ( int leftIndex) and ending index (int rightIndex).2. I realize a simple routine which computes for a given open bracket its counterpart closing bracket int Expression::FindIndexClosingBracket(int IndexOpeningBracket)3. I generate the tree by setting simple rules for splitting an expression. ***Parsing*** 3.0 Start. 3.1 If an expression starts with an open bracket whose closing bracket is the expression's end, I "remove" the brackets(shift leftIndex and rightIndex) and go back one step; else I proceed. 3.2 If an expression doesn't fall in the category described in 3.1, it is obvious that either 1) it is an atomic expression (can't do anything with it - say, a constant) or 2) there must be an operation token some place that is not enclosed by brackets. The cycle after //label: find operation not enclosed by brackets finds that operation token if it exists. 3.2.1 Important note. When finding intermediate operation tokens, one must be careful for the order of precedence of operators. For example, in a*b+c, a valid split of the expression is add(mult(a,b),c), which means that the in step 3.2 we are allowed to pick only the '+' token. That is what all the int Tokens::AssociativityWeakness(char x, bool isUnary) jazz is all about.3.3 a) If an intermediate operation token is found, we split the expression with bool ExpressionLeaf::SplitInTwo(int operationIndex). The newly created expressions (ExpressionLeaf) are recorded in our global Expression.3.3 b) If no intermediate operation token is found ("atomic expression case") we "mark" the expression as reduced by returning with a true. 4. We execute step 3 to all non-atomic expressions. Note that the function return values are set so that one doesn't actually have to keep a bool isReduced member of ExpressionLeaf.*** So that was the parser. Once you have the tree structure, evaluating it recursively is a piece of cake. ( int ExpressionLeaf::ComputeRecursively())
| |
|
Last edited on
|
|
| helios (10126) | |
|
Well, it's very simple, really. The main components are the rule list and the reduce() function. Each rule in the rule list specifies what will the top of the stack be reduced to if it matches a list of constraints and the lookahead token matches a type. The rule may also specify that the lookahead token can be of any type and that the stack should not be reduced if it matches that rule. For example, one of the rules (rule 10) says that if the top of the stack contains an expression, a +, and another expression and the lookahead token is a *, then the stack should not be reduced. On the other hand, another rule (rule 6) says that a stack containing expr '*' expr should be reduced to an expr regardless of what the lookahead token is. The reduce function finds which rule to use to reduce the stack by looking for [the biggest rule that matches the top of the stack and the lookahead token]. If there are two rules (i.e. both rules have constraints of the same size) that meet this condition, it will choose the first one it finds, so the order of the rules is crucial. If it doesn't find a match or the match specifies that the stack should not be reduced, reduce quits. Otherwise, the rule is executed, the stack is popped and then pushed back with the new non-terminal. run_state() (actually, I should rename the function) is pretty self-explanatory. Example: 2^2^2+1+2*3 INTEGER POW INTEGER POW INTEGER PLUS INTEGER PLUS INTEGER MUL INTEGER Stack: <empty> Can't reduce further Shift INTEGER Stack: INTEGER Reduce with (expr -> INTEGER): {expr|INTEGER} Can't reduce further Shift POW Stack: expr POW Can't reduce further Shift INTEGER Stack: expr POW INTEGER Reduce with (expr -> INTEGER): expr POW {expr|INTEGER} Can't reduce with (expr -> expr POW expr) because lookahead is POW Can't reduce further shift POW Stack: expr POW expr POW Can't reduce further Shift INTEGER Stack: expr POW expr POW INTEGER Reduce with (expr -> INTEGER): expr POW expr POW {expr|INTEGER} Can reduce with (expr -> expr POW expr) because lookahead is not POW: expr POW {expr|expr POW expr} Can reduce with (expr -> expr POW expr) because lookahead is not POW: {expr|expr POW expr} Can't reduce further shift PLUS Stack: expr PLUS Can't reduce further Shift INTEGER Stack: expr PLUS INTEGER Reduce with (expr -> INTEGER): expr PLUS {expr|INTEGER} Can't reduce with (expr -> expr PLUS expr) because lookahead is MUL Can't reduce further shift MUL Stack: expr PLUS expr MUL Can't reduce further shift INTEGER Stack: expr PLUS expr MUL INTEGER Reduce with (expr -> INTEGER): expr PLUS expr MUL {expr|INTEGER} Reduce with (expr -> expr MUL expr): expr PLUS {expr|expr MUL expr} Can reduce with (expr -> expr PLUS expr) because lookahead is not MUL: {expr|expr PLUS expr} Can't reduce further Look ahead is END The stack length is exactly 1, so were no errors. My approach is simplistic, which is a plus, but unlike a state machine, it can only detect that some error has occurred, not where it occurred, because the error detection is a single at the end of execution. However, it's good enough to generate at least a simple parser from Yacc-esque rules. Right now I'm working on how to do that. One of the problems I need to solve is "how do I know (expr '+' expr) and (expr '*' expr) are in conflict when there is no extra precedence information?" | |
|
Last edited on
|
|
| tition (712) | |
|
Can you explain the format of your Rules (with words if possible because I really lost in the syntax:()? There is a bug with the POW token, 1+2^2=9. | |
|
Last edited on
|
|
| computerquip (1892) | |
| Wow. I tried learning the concept of LALR parsers and my brain hurts now. I'll save this for another day lol. It seems to be one of those things that take a bit to digest. | |
|
|
|
| helios (10126) | |
|
The first parameter to Rule::Rule() is what the rule reduces to. If Token::null is passed, the parser will not try to reduce. This is used to enforce precedence and associativity. The second parameter is the constraint on the lookahead token. If Token::null is passed, there's no constraint. The third parameter is the number of variadic parameters to follow. From then on are the constraints that will be applied to the stack. And yeah, you're right. I forgot to add some more rules to PLUS, MINUS, DIV, and MUL. a<anything other than ^>b^c==(a*b)^c, because I accidentally gave ^ the highest precedence. Luckily, fixing this is just a matter of copy-pasting and slightly editing a few lines. | |
|
Last edited on
|
|
| tition (712) | |
aha... I think I finally got it:/*rule 1*/Rule(Token::expr, Token::null,3,Token::LPAREN, Token::expr, Token::RPAREN )means: if the last three tokens are [LPAREN, expr, RPAREN] (in this order) and if the lookahead is null = arbitrary or there is no lookahead token, then substitute [LPAREN, expr, RPAREN] with expr. Schematically: LPAREN, expr, RPAREN --> expr 4th param 5th param 6th param 1st param 2nd param specifies what the lookahead token must be. 3rd param is a technicality. Great idea! And it is very fast too! Cheers! | |
|
Last edited on
|
|
| mcleano (922) | |
| tition, both your codes are way over my head! but your main is of type void, shouldn't it be int? | |
|
|
|
| jbrooksuk (30) | |
|
Wow that's a fair bit of code there. It's a bit complex for my liking but I have a lot to learn from it. | |
|
|
|
| computerquip (1892) | |
|
It's not the code itself that troubles me. It's simply the concept of the LR or LALR parser. Wikipedia gives a mediocre example on how it works. Here's a decent tutorial that puts it in much better terms: http://www.devincook.com/goldparser/doc/about/lalr.htm This tutorial doesn't explain non-terminal symbols so here: http://en.wikipedia.org/wiki/Terminal_symbol | |
|
Last edited on
|
|
| helios (10126) | |
|
It sounds more complicated than it actually is. A good introduction to the subject is generating a parser with Yacc or Bison (that's how I learnt it no more than a month ago). Bison's manual also helps a great deal to understand how the parser works. | |
|
|
|
| tition (712) | ||||
|
A suggestion to helios. It would be nice to be able to suggest to the user possible mistakes. So, it completely makes sense to build the tree structure underlying the parsed expression. (I have no clue how you would give error suggestions otherwise). That will be very easy to implement on top of your code: in the reduce function, besides evaluating the expressions, you can also build the underlying expression tree structure (from the bottom up).
This way, you will build the expression tree structure in a much nicer fashion than I do. This is in fact the main difference between my slow approach and yours - I build the tree "from the top down", and you build it "from the leaves up". Of course both approaches are correct, but mine is O(n^2) and yours is O(n) (where n is the size of the expression to be parsed), which will be quite a difference if n is 1000 :). It will be nice for me to try to merge a tree structure in your code. I was thinking first of applying your approach to my code but I think you actually did the tougher part, so it would be quicker to just paste stuff to your code. If I find the time to do so I will post the result here (I will note the code I took from you, but will probably rename it to my tastes :). What I like with your parser is that you actually have no trouble parsing expressions such as "-1" (which is an "unary" operation (i.e. takes only one argument)) - you just add some extra rules (rules 2 and 3 in your code). In the same way you will have no trouble parsing functions with more than two arguments. Cheers! P.S.
I don't know... *scratches head* Umm, why should it be int? | ||||
|
Last edited on
|
||||
| helios (10126) | |
|
While I agree that it's not hard to replace the stack with a tree (I originally did that in the opposite direction for the sake of simplicity and efficiency), I still very much doubt you'll manage to get a more advance error checking into my design without throwing away some of the generality, which is the point of using rules. The problem with reduce() is that it's unaware of what the rules do. It's either able to reduce, or unable to reduce, and neither case necessarily mean there's been an error. I'm currently trying to figure out how to generate a syntax table from a set of reduction rules (e.g. any of the rules that don't reduce to null in my code). The Wikipedia article on LR parsers has helped somewhat. I would prefer if they used full names for terminals and non-terminals in compiler theory, rather than just single letters, though. It'd make things easier to follow. EDIT: void main() is non-standard. | |
|
Last edited on
|
|