Making a programming language

Pages: 1234
I have, but only when using badly written inline assembly...
Could anyone give me an example of string parsing in C++? I want to grasp this subject as soon as circumstances permit.
http://www.cplusplus.com/forum/lounge/11845/
The complete parser is here: http://www.cplusplus.com/forum/lounge/11845/#msg56331
The explanation of the parsing mechanism is here: http://www.cplusplus.com/forum/lounge/11845/#msg56388
Note that the parser has a semantical bug that causes ^ to have the lowest precedence.
I kind of get how this parser is supposed to work.. you have tokens and rules, a statement is only valid if it "equals" a rule. This rule is, finally, subdivided into tokens. Is that right? Also, taking this concept into account, would it be possible to recursively define a rule as a token?
No, not quite.

There are two kinds of objects: terminals (also known as tokens), and non-terminals. A terminal is the smallest lexical unit: an identifier, a literal constant, a plus sign, etc. A non-terminal is a complex object made up of one or more objects. Examples of non-terminals are expressions, statements, code blocks, complete programs, etc. Parsing a stream of tokens involves, explicitly or implicitly, building a tree (an abstract syntax tree) whose branches are non-terminals and whose leaves are terminals. For example, the input "1+2*3" can be translated to the following AST:
expr(expr(INTEGER),PLUS,expr(expr(INTEGER),MUL,expr(INTEGER)))
The process of building a non-terminal from one or objects is called reducing. This is where the rules come in. A rule tells the parser exactly what sequence of objects can be reduced into what non-terminal, and under what conditions. For example, the sequence expr PLUS expr shouldn't be reduced to a single expr if the following token is MUL (because then 1+2*3 == (1+2)*3).
My friend and me have gotten into some more detail and we're making some progress, finally. :P I'm going to try and write a more simpler LR parser, first.

So.. reducing is basically converting more terminals into non-terminals, based on the defined rules?

EDIT:
You also said that a literal constant is one of the "smallest lexical units", but isn't a normal integer capable of being built out of more than one number (12 consists out of 2 numbers, for example)?
Last edited on
So.. reducing is basically converting more terminals into non-terminals, based on the defined rules?
Not necessarily terminals. You can reduce a string of non-terminals into a single non-terminal.

isn't a normal integer capable of being built out of more than one number (12 consists out of 2 numbers, for example)?
Don't confuse terms.
Number: a value that represents a distance from a point to the point of origin, generally on the Real line.
Numeral: a symbol used to represent a number. E.g. the character '5', or a 5 V current flowing through a conductor.
Digit: in a positional system, a position on [the string of numerals that represents a number]. E.g. in the decimal string "1024", '4' isn't a digit, '4' is the value digit 0 holds.

A token is a unit of meaning. The characters '1' and '2' have no meaning in themselves, but when they appear adjacent to each other they create the string "12", which can have meaning. It can represent how many eggs are in a fridge, or a distance in meters. A simple character can't do any of these without first being assigned a meaning when it appear by itself. E.g. in C '/' means "division" when it appears by itself, but it losses meaning when there's a '*' immediately after.
It's possible to write a parser that treats numerals as tokens and then strings them together by reducing expressions, but in practice this is never done because it's much more efficient to let the lexer do it. And besides, treating literals as non-terminals is conceptually obtuse.
helios wrote:
... because it's much more efficient to let the lexer do it.
What's a lexer? (Sorry for not bothering you all the time. :P)
Man I'm amaze how much knowledge you guys have. I never knew a compiler theory exist untill helios mentioned it.

This has been interesting for me. Can anyone tell me how and where to learn more about parsing and bytecode? Is a .swf file (flash) considered a bytecode?

Open course ware would be nice :)
The Dragon Book is considered the book on compiler theory; and yes, it is.
An off-topic @ chrisname:

and given that the vast majority of people on the Internet are American


...ahem...

Given that a false proposition is true, I can prove anything to you.

http://en.wikipedia.org/wiki/File:DSL_countries.png
http://www.internetworldstats.com/top20.htm
Ok, the vast majority of people on the Internet (who aren't Chinese) are American. I thought I saw somewhere that Americans made up the majority of Internet users; perhaps it was just English-speaking Internet users and I didn't notice.
I'm going to try to get my hands on the Dragon Book in a minute. Helios, I re-checked your parser code, and it's becoming more clear what it does by the second, just a few things:
You define this as your constructor for the Rule-struct:
1
2
3
4
5
6
7
8
9
Rule(const Token &to,const Token &la,unsigned constraints,...){
	this->reduces_to=to;
	this->lookahead=la;
	va_list list;
	va_start(list,constraints);
	this->constraints.reserve(constraints);
	for (unsigned a=0;a<constraints;a++)
		this->constraints.push_back(va_arg(list,Token::TokenType));
}

I understand that you reduce more tokens to "types" (Token::INT or Token::NULL), to be able to handle them as these respective types if necessary. Now, what does the lookahead token do, what are constraints and what do the triple periods indicate?

Thanks in advance.

EDIT:
Figured that the constraints indicate the amount of arguments after that point, and that the lookahead indicates the precedence the rule has. .. Am I right on this? (Still want to know how to use the triple-periods properly, though)
Last edited on
the lookahead indicates the precedence the rule has
No. The precedence is derived from the order in which the rules are "declared". The lookahead is used to define associativity.

The ellipsis is for variadic arguments.
What if my operators (defined in the enum types) would have to span more than one character? (like ==, >=, <=, not, or and and)
Last edited on
That's handled by the lexer. You could define your operator as is equal to without changing a single line in the parsing algorithm (other than the lexer, obviously).

PS: Next time you need to say something else, either make a new post or delete your previous one. I only saw your edit by coincidence.
I just decided to pop down to this thread to wish Kyon good luck with his language. I'd also like to one-up the Dragon Book, as well as most of what helios said in his posts.

-Albatross
So the only thing the parser really does is reduce terminals to terminals/non-terminals, and throw an error now and then?
What is different between this->lookahead=la; and lookahead = la;?

PS: Will keep that in mind for in the future. ;)

EDIT: Albatross, thanks mate. ;)
Last edited on
The parser also takes appropriate action depending on a pattern of terminals.

EDIT:
http://www.codersource.net/c/c-tutorials/c-tutorial-this-pointer.aspx

-Albatross
Last edited on
Pages: 1234