Bison Tracking

aka. How to create a parser in C++ without using Flex or Bison PART 1.


Revision: 2nd

Why the Fortran would anyone want to do this? (Pardon my language). It's because C++ parsers that you make yourself:

-Produce more human friendly code
For starters, the files that Flex and Bison output can be very hard to understand, even if they do have some decent comments. This can make them hard to integrate into other programs. However, since you wrote most of this and hopefully read this article, you have an idea of what's going on.

They also have a tendency of being large.

-Can have more tokens of lookahead when generating an LALR parser
Bison supports only one token of lookahead when not generating a GLR parser. That means that if you have a grammar like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
%{
	#include <stdio.h>
%}

%token AND PLOW HORSE MULE OX CART

%%
PHRASE : work_animal AND PLOW
	|	cart_animal AND CART
	;
 
cart_animal : HORSE
	|	MULE
	;
 
work_animal : HORSE	
	|	OX	
	;
%%


...Bison will not be able to create a parser out of it.

girlfriend:~ albatross$ bison /Users/albatross/Desktop/impossible.y
/Users/albatross/Desktop/impossible.y: conflicts: 1 reduce/reduce
/Users/albatross/Desktop/impossible.y:16.15-19: warning: rule never reduced because of conflicts: work_animal: HORSE
girlfriend:~ albatross$


If you eliminate all instances of AND, Bison will happily chew it up with no problems. That's one token of lookahead. Sometimes, you will need more than that.

-Are far more customizable
Let your imagination run wild on this one.


This article will teach you how to create a simple arithmetic parser, which you may then use as a framework for other parsers. So sit back, get a cup of something to drink and keep you awake through my inconsequential ramblings of redundancy, and enjoy those ramblings... OR ELSE!

Prerequisites:
Must be VERY familiar with the C++ standard library (especially deques and strings), and understand C++/ENGLISH (in other words, be a semi-experienced C++ programmer).

Oh, and you must know what Flex and Bison are, what they generate, and how to use them.

WANRING: "Evil" features of C++ may be used in this article. If you do not feel comfortable with using said techniques, you may now kiss the bride and then forever hold your peace. (out)

-General idea

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 using namespace std;

//Given this struct:
struct uraldamage
{
   deque<float> deque1;
   deque<int> deque2;
   deque<bool> isoper;
   int charcount;
   int prioritycount;
   int maxpriority;
};

//Your program should be structured like this:
int preprocess(string & stringtheory);
int lexer(string & kitten, uraldamage & groindamage);
int parser(uraldamage & donotusethisasavariablename);
int main()
{
   cout<<"Gimme an arithmetic problem!\n";
   uraldamage data;
   string storeage;
   getline(cin, storeage, '=');
   preprocess(storeage);
   lexer(storeage, data);
   parser(data);
   cout << data.deque1.front();
   return 1337;
}


Reading from files will be discussed in the next part.

NOTE: We're using deques instead of maps for a reason. Can you guess what it is?

-Your Preprocessor
This is just a simple lexer and occasionally parser that will make your code a bit easier to understand for your main lexer. It's always a good idea to have one; it makes your code more readable.

In this case, we want to get rid of all the white spaces and commas; our grammar doesn't need 'em. Just use cctype's iswhitespace(a_char); along with a counter a_char==','; to determine which characters to erase(counter,1). Return zero.

-Your Lexer

A parser without a lexer is like a banana split with a pickle instead of a banana (eww).

Here, we are converting our string into tokens: little bits of data that have a value associated with them that our parser can manage better.

Note: when I say read into a deque, I mean use push_back().
G#: when I say delete from a deque, I mean use erase().
First counter/integer refers to groindamage.charcount.
Second counter/integer refers to groindamage.prioritycount.
Third counter/integer refers to groindamage.maxpriority.

First, check if the first character in the string is a minus sign. If it is, then add a zero just before it, then read 0 and the ASCII value of '-' into deque1, 4 and 4 into deque2, and false and true into isoper. Then delete the minus sign in the string, and continue. You will also need to ensure that the next token gets a value of 3 instead of 0 (see later).

If you didn't find the minus, skip that bit.

Use find_first_of("+-*/%^()") to determine the location of the first operator. Store it in the first integer. Then, use atof() to read everything from the first character to but excluding that first operator into deque1. Read 0 into deque2, and false into isoper.

Then read the next character's ASCII value into deque 1, 0 into deque2, and true into isoper.

Then erase everything in your string up to and including your operator. Repeat until your string is empty, then return zero.

-Your Parser

Now for the fun bits. Anyone who has done any arithmetic will know that multiplication holds priority over addition and subtraction. They will also know that raising something to a power holds priority over multiplication, and that statements in parenthesis are always evaluated first. Those of us who are also looking carefully will notice that we also need to ensure that multiplication by minus one must occur before any of these operations, and that in our current system it will not happen first but rather LAST. Now we know why we had deque2: to determine the order of operations.

You will need to run over the contents of deque1, and clear out any parenthesis. Read token by token, setting the value of the corresponding element of deque2 to the value of prioritycounter + its original value (which should have been intialized to zero... is it too late to tell you that?).

If at any point, prioritycounter > maxpriority, then maxpriority = prioritycounter.

If you see an addition or subtraction sign, then set the corresponding deque2 value to (original value + 0 + prioritycounter).

If you see a multiplication or division or modulo sign, then set the corresponding deque2 value to (original value + 1 + prioritycounter).

If you see a power sign, then set the corresponding deque2 value to (original value + 2 + prioritycounter).

If you see an opening parenthesis, increment prioritycounter by three, and delete the token.

If you see a closing parenthesis, decrement prioritycounter by three, and delete the token.

Once that's done, the parser is going to run through the program (maxpriority + 1) times, starting from the tokens that have the highest deque2 value. It will evaluate an expression in deque1 that has values in isoper of false true false, and evaluate it to get a float back out. It will then eliminate those three elements using erase, and write in their place to deque1 the resulting float, in deque 2 the original priority minus one, and to isoper false.

When the end of the deque is reached, scan and evaluate the next lower priority level.

When you get to priority 0, then you can just proceed linearly, evaluating three tokens, pop_front()ing them, and push_fronting the resulting value and restarting the operation. When your deques all reduce to a size of one, then return zero.

-Afterward

The first time you compile this, you might get loads of fatal errors. Take your time and work through them; this was no easy task. The good news is that the worst is over now that you created your first parser: You have a framework for creating new parsers from that.

If this article is popular enough, I will create a second article on parsing a more human language, and while I'm at it solve the tokens of lookahead problem.

This article is subject to change without notice.

-Albatross

Last edited on
I have to disagree:
It's because C++ parsers that you make yourself:
-Are more human friendly
The part of a bison file which is not in C ( or C++ ) is a common notation of the grammar, which is more readable than a parser code and you'll have to write something like that anyway to know what you are doing
For starters, the files that Flex and Bison output are very hard to read
Not really, plus you don't have to read them since you shouldn't modify them
and use a little bit more than the C or C++ standard libraries.
This is false, they use standard C library, other headers only if specified via some macro definitions
-Are far more customizable
Write grammars that rely on whitespaces, context, some other files… you're pretty much free to do what you want in terms of grammar and lexical structures.
You can do these things using flex and bison too
Edited for correctness. Thank you for your input.

Though I'm pretty sure Bison requires linkage with a separate library... or am I wrong? I know that Flex requires none...

And where is that common notation in the file?

-Albatross

EDIT: 500 posts and counting.
Last edited on
No bison requires only few components of the standard C libraries
The notation is the Backus–Naur form
What sort of parser is this? LALR? Some other type of LR? LL? Recursive descent? This is important information.

This thread may be relevant to your interests: http://www.cplusplus.com/forum/lounge/11845/
Last edited on
Thank you for asking.

The parser appears to act as a bottom-up parser with zero tokens of lookahead that constructs a rightmost derivation, however I'm not 100% sure if I can classify it as LR(0).

-Albatross
Oh, and I just realized. The problem with that grammar is not that it requires 2 lookahead tokens. It's that HORSE can be possibly reduced to two nonterminals. Bison doesn't reject grammars that require a LALR(k>1) parser to be unambiguous. I'm not even sure if it gives a warning.

It's possible to parse a language that need more than one LA token using Bison. You just have to tweak the grammar to use the parser's stack. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
%right ITOA ATOI
%left '+'
%left STRING_CONCATENATION
%left '(' ')'
%%
string:
	STRING |
	ITOA expr |
	'(' string ')' |
	string '+' string %prec STRING_CONCATENATION ;

expr:
	INTEGER |
	expr '+' expr |
	ATOI string ;
%%
If STRING_CONCATENATION and '+' had the same precedence, this grammar would be ambiguous in the case atoi "1" + 2. The parser would reject the input for trying to add a string and an integer.
With the added precedence, the parser will shift the tokens and wait a bit to see what comes next in the input. Depending on what comes, it will reduce using the appropriate rule.
Edited for correctness.

-Albatross
Topic archived. No new replies allowed.