General parser - Error handling

Hello everybody,

I'm writing my own general parser. I know many will ask me why? The reason is simply that I like to code my own libraries, and that I learn a lot by doing that.

That being said, my parser works. I tried it with several non trivial grammars, and it works like a charm. Good! My problem is when there is an error in the input source file: The parser warns the user about it, but it is unable to locate the line and column number of the offending instruction or character. I just don't know how to handle errors, because it is not as trivial as it might sound.

The difficulty is that this is a general parser where it is the user who defines the production rules. So the grammar is "not known by my library" at the time I am coding it. It is only when the user includes the library in his code that he sends the production rules to the parser. So I must find a way to handle errors that is totally independent of the grammar.

For example, suppose I had to write a specific parser for the simple grammar

E -> E+T | E-T | T
T -> T*F | T/F | F
F -> N | (E)
N: real number

and suppose the production rule E+T is tried and a plus sign is found. In this case we know that the E+T rule must necessarily be true for the input source to be correct. So, if trying E returns true but T returns false, then we know that there is a mistake in the input and the parser can locate the error as being between the tokens at the beginning and the end of the T that was tried.

But I cannot make it a rule for error handling, because if the user decided to improve his grammar and include the unary operator + and -, like

E -> E+T | E-T | T
T -> T*F | T/F | F
F -> N | +E | -E |(E)
N: real number

then finding a plus sign but failing to evaluate E+T is no longer necessarily an error, because the production rule +E could succeed.

So, without knowing the grammar at coding time, how can the parser determine which part of the source is likely to contain the error?

Val
Last edited on
No idea? :(
there is probably a more efficient way but if you know an error hit between some starting point and 'here' then you can crawl back over that block looking for the error with some kind of logic based off what type of error it was.
there is probably a more efficient way but if you know an error hit between some starting point and 'here' then you can crawl back over that block looking for the error with some kind of logic based off what type of error it was.


That's indeed the beginning of an idea. I'll think about it. Thank you!
To start, you need to store the line and position number with the current token. If it fails to parse then you can complain and print the offending line with the token marked. This assumes that the you only need to look ahead by 1 token to parse the language (LL1? it's been a long time...)
To start, you need to store the line and position number with the current token. If it fails to parse then you can complain and print the offending line with the token marked. This assumes that the you only need to look ahead by 1 token to parse the language (LL1? it's been a long time...)


Yes, I'm already storing the line and column numbers of each token. It is a quite general parser, so it accepts the following types of rules ('Ter' stands for terminal elements and 'Non' for non-terminal elements):
1
2
3
4
5
6
7
8
Non --> Ter
Non --> Non               
Non --> Ter Ter
Non --> Ter Non
Non --> Non Ter
Non --> Non Non
Non --> Ter Non Ter
Non --> Non Ter Non

The difficulty is that a failing rule does not imply a mistake. So I cannot just stop as soon as a rule fails and report an error. For example, consider the following grammar:
1
2
3
4
Program     --> Instruction Program | Instruction
Instruction --> Expression ;
Expression  --> Expression + Expression | Number
Number: real number

With this grammar, the parser recognizes all programs that consist of expressions separated by semicolons, where an expression is either a real number or a sum of real number. Suppose that I give it the following input that contains an error on the fourth line:
1
2
3
4
2+4;
4+7;
8;
7-;

We could naively claim that for the program to be correct, all instructions must be correct. Thus, as soon as an 'Instruction' rule fails, I can report the error. But it is not as simple, because in order to start with the first rule 'Program', the parser has first find an 'Instruction', and to do this, it has to to try the 'Instruction' rule from the very first token while increasing the current token until it finds a match (that is, the current token falls on the first semicolon). This means that the "Instruction" rule will fail several times, even though there is no mistake on the first instruction. So I cannot report the error at that time.

Of course, if the grammar was known at coding time, I could solve the problem by searching first for a semicolon, and then applying the "Instruction" rule.
Last edited on
Topic archived. No new replies allowed.