Are lexers generally ugly?

I'm trying to write a lexer for a language that's a bit like GLSL (I had an idea to write a window system that is controlled entirely by things similar to shader programs which are compiled in the client and then sent as bytecodes to the server to tell it how to render things; I also had an idea to create a windowing system for text displays, so I decided to combine them [I'm crazy]). I've never written a lexer before (not a complex one, anyway), so I don't know how it's going to turn out or what it should look like. Currently, it's basically a for loop and a giant switch, with some other switch statements and for loops embedded in it (it's ugly, but it'll be refactored into lots of functions to make it less ugly once I get to some kind of milestone).

It scans the string character-by-character, then when it sees an interesting character it can either append it to an array or append everything in between that character and an end character to the array (for strings and stuff). Then there's another function which will read the token stream and convert it into bytecode (while checking that the token stream actually makes sense, which I'll do by having a function that takes two tokens as parameters and then decides whether or not they should go next to each other (so, f(INT, INT) returns false, while f(INT, VARIABLE_NAME) returns true [no, I won't really call the function 'f']).

I've looked at some very simple Flex-generated lexers and they were pretty ugly - it was about 2,000 LOC just for something that printed "start" or "stop" depending on which you typed.

I was conscious while writing this that it's probably quite jumbled because I kind of wrote things as I thought of them, and also because sleep deprivation.
I'll admit, I think it's quite hard to make a pretty lexer. By their nature, they're rather icky because they often have to handle a lot of different character combinations that aren't easy to bunch up into a few distinct categories (which can make the code a LOT prettier). It's especially true with Flex, which doesn't know what you'll be doing with those tokens later.

Anyways. That's my opinion on the matter. :)

-Albatross
That's good. I'll just write it, get it somewhat working, and then try to split it into chunks.
You can start your design with a DFA and code each state as a function.
Since functions are not stateless you can merge similar nodes or subgraphs in a single function.

Translating a lexer from a loop and switch into a functional automaton can be hard work if the lexer isn't trivial.

Then there's another function which will read the token stream and convert it into bytecode (while checking that the token stream actually makes sense, which I'll do by having a function that takes two tokens as parameters and then decides whether or not they should go next to each other
You can do all this in a single pass with a parser.
Or in two passes with a parser that generates a tree structure and a generator that translates that tree.
You can write a simple predictive parser in a similar way that you'd write the lexer.
My opinion is: Create first a short-code version, like, calling functions with 4-5 characters (being the first two the ID, or use the DLL Name & Function Name) then some Parameter IDs, like: char ParametersBuffer[256][64];
And maybe a Parameter Type Enum like enum ParamType { PT_INT = 0, PT_LONG, PT_CHAR, PT_CHARPTR, PT_VOIDPTR }; and act in the desired way, or something like that.
Then create a long-code version, maybe looking like C also, that your program will "convert" into short-code version. I used to do it in my game-scripting engine test, and it worked pretty well, it's just a bit too complex to explain indeep on-the-fly.
@Bazzy,
The idea of using two passes was to have the lexer simply convert source code into tokens, without having to understand what the tokens mean. Then the compiler proper simply converts the tokens into bytecode, but at the same time it will be checking that tokens make sense next to each other. Alternatively, in the lexer, I could have a variable that stores the tokens that are allowed to follow the current token. I think my current model is a little easier though. It shouldn't be too difficult to split it up into multiple functions, all I was going to do was have each case in the switch call a different function (so that I don't, at first glance, have loops nested in switches nested in a switch nested in a loop, which is what I have currently).
[edit]
But your idea of writing it as a sort of state machine is interesting. I might switch to that model. I don't know what you'd call what I have at the moment. It just scans the string for tokens (or the start of a token) and then figures out where the token ends, and tacks the whole thing onto the end of a buffer, along with an integer that says what the token is. It's just an array of these:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/** ttydel token types. */
typedef enum __ttydel_token_type__ {
	/** Character constant. */
	TTYDEL_TOKEN_CONST_CHAR,
	/** Floating point constant. */
	TTYDEL_TOKEN_CONST_FLOAT,
	/** Integer constant. */
	TTYDEL_TOKEN_CONST_INT,
	/** String constant. */
	TTYDEL_TOKEN_CONST_STRING,
	/** Function definition. */
	TTYDEL_TOKEN_DEFN_FUNC,
	/** Type declaration. */
	TTYDEL_TOKEN_DECL_TYPE,
	/** Variable declaration. */
	TTYDEL_TOKEN_DECL_VAR,
	/** Binary operator. */
	TTYDEL_TOKEN_OPERATOR_BI,
	/** Unary operator. */
	TTYDEL_TOKEN_OPERATOR_UN,
} ttydel_token_type;

/** A ttydel token. */
typedef struct __ttydel_token__ {
	/** The token type. */
	ttydel_token_type	type;
	/** The token as a string. */
	char*			value;
} ttydel_token;

I have some more tokens types to add (function calls, for example) but that's all I could think of at the time.

@EssGeEich,
For the functions, I want to allow function overloading so I'm going to take the name of a function and add its return type and the types of its parameters together (name mangling) like this:
1
2
int foo(int bar) => foo_int_int
int foo(float bar) => foo_int_float

That should make it possible for functions to not only have different parameters, but also different return types. Then, as long as every call to every function is mangled in the same way, there don't have to be any special cases. All the lexer needs to know is which version of the function is being called, which can it can figure out quite easily by looking at the symbols immediately before and after the function call, making a mangled function name, and then looking in the symbol table for a match. The symbol table will be a list of global symbol names and their types generated by the lexer (every function then has its own local symbol table) and completed by the compiler, which will go through the symbol table(s) and find the index into the bytecode buffer and write it down. Then the whole thing just gets sent as a single buffer to the server, which simply runs the program and uses it to figure out what to write and where. It sounds insecure, but the purpose of the bytecode programs is just to generate data to be drawn to the screen, there shouldn't be any way to exploit it (that's not to say no-one will find one, but if they do, it'll be something really obvious like a buffer overrun; and that's assuming anyone else ever looks at the source code or uses the program in any way).
Last edited on
Well, a little suggestion about function names:
What about:
int foo(int bar) -> (ParametersCount)(TTYDEL_TOKEN_CONST_INT)(TTYDEL_TOKEN_CONST_INT)foo
So instead of using 3+1+3+1+3+null terminator (12 bytes) it will use 3+1+1+1+null terminator (7 bytes)

Also this will allow function naming with underscore.
Last edited on
So, in that example, it'd be "133foo\0"? I could do it like that. It doesn't really matter what the functions are called, because nothing ever has to parse them, the only requirement is that symbol names are distinct from non-symbols and each overload has a different mangled name.

I think I will do it your way, although the number of parameters doesn't need to be encoded: the only time that would be useful is when all the parameters have the same type but the only difference is the number of them, but even that would be taken care of by just encoding the type of each parameter.

In other words, the way I'll do it is like this:
1
2
int foo(int bar) -> foo33 /* foo int int */
int foo(int bar1, int bar2) -> foo333 /* foo int int int */

even though it doesn't matter in the bytecode, I think it's better for functions not to start with a number.
All the lexer needs to know is which version of the function is being called, which can it can figure out quite easily by looking at the symbols immediately before and after the function call, making a mangled function name, and then looking in the symbol table for a match.
You're giving the lexer too much responsibility. The only output from a lexer should be "IDENTIFIER LPAREN NUMBER COMMA IDENTIFIER RPAREN" with possibly extra data attached to each token (e.g. an IDENTIFIER token should know the name of the identifier). Figuring out what the tokens mean when put together should be up to the parser.
Wow, welcome back (again).

You're right, I suppose name mangling doesn't really belong in the lexer. The only other things it does are pretty much what you said - an enum value saying what the token is, and then the value of the token itself (for constants and identifiers).

I'll move that to the subroutine that converts tokens into bytecode (parser/compiler/translator, whatever you want to call it).
I've looked at some very simple Flex-generated lexers and they were pretty ugly

If you're using C++, take a look at boost.spirit then: http://www.boost.org/doc/libs/release/libs/spirit/doc/html/index.html
even though it doesn't matter in the bytecode, I think it's better for functions not to start with a number.

Chrisname:
Take care, what if user then creates a function with a terminating number? Will it be a parameter or the function name?
So, in that example, it'd be "133foo\0"?

Not really, it'd be "\1\3\3foo\0" (see not using atoi/itoa to get value but writing the RAW code, but then you'd be limited to 255 types. Maybe using two chars for each parameter would be enough.
I started out with enums just like you (following an example from helios, btw). However,
I play a lot with token types/languages specs*. Also, I don't need fast parsing time. That is why I stopped using enum tokens: adding a new token means
1) you have to add a token in your .h file,
2) you have to bind the token to the corresponding characters/word(s) at the input
3) you have to bind the token to the corresponding char/word at the output (you don't want error messages like "token 1 must not follow tokens 5,6,3").

2) and 3) can already be merged with enums (but kinda kills the purpose of enums), but you have to do at least 2. cpp edits to add a new token (I had to do 4+). Also, it is easier to get bugs with enums, and harder to read.

Instead, I started using a std::string to integer map hashed list. Now, adding a token is just
Commands.AddToken("+", "Token description", "Example");. When I reference tokens I simply use their human-readable names or, if I do need the actual token, I do something like
Tokens.GetIndex("+").

Advantages of non-enum approach: much faster to edit, read/understand, maintain and experiment with.
Disadvantages: will be *considerably* slower on a large file.

*for example, I am adding at the moment the tensor operation, in addition to +,-,*,/,^,==,>,<, etc. [Edit:] I checked, at the moment I have 47 different syntactical tokens, so there are quite a few more.
Last edited on
@EssGeEich,
I had a realisation last night that TTYDEL_TOKEN_CONST_INT is not a type, it's for constants embedded in source code. For example, in C++, 500 is an integer constant. That was my intention. For some reason, I didn't notice until then.

@tition,
It needs to be fast enough to compile several short programs all in one go, and then send the bytecodes to the server. A single program will probably never be large (I can't see them ever being more than a few hundred lines, if even that) but there could be a lot of them (potentially hundreds). They can be compiled and then the bytecode saved so that they don't have to be compiled again, but I think it's safer to assume that they will be compiled in one go at runtime.

Also, I forgot to mention that I'm using C, not C++.
@chrisname: It takes nothing to add TTYDEL_VAR_INT and such things to your enums.
@chrisname
Using c complicates things (as little as I have tried c; would null-terminated strings make things easier to mess up?).

At any rate, whether it is c or c++, my advice reads: try to make binding a token to a string of characters to be one line of code only, even if that means dynamical binding. The flexibility benefits are big. Also, try making each of your grammar rules a single line of code, for the same reason.

Adding a new operator/keyword/syntax element should optimally* be N+1 lines of c/c++ code, where N is the number of rules that your operator satisfies.

[Edit:] *I am not quite there yet: for me the number of code lines is 2N+1
[Edit:]Dynamical binding adds a not-so-small additional flexibility - for example registering pointers to handler functions - my token registration is something like:
1
2
3
4
5
6
Commands.AddToken("+", pointer_to_handler_function, "Token description", "Example"); //<-in addition, the AddToken 
//function checks whether the token string has already been used,
// (are there more than one "+"'s?) 
//and makes a gracious full program crash if that is not the case.
//believe it or not, this happened more than once: quite often 
//I find myself copying+pasting syntax rules 
Last edited on
@tition: C++ Strings are slower than C "strings" (char pointers). And, personally, I prefer C over C++'s "Out-of-the-box" functionalities. But I prefer C++ because of the class functionality.
@tition,
I'll take your advice. Using C shouldn't make it that much harder. I can use a stripped down, modified version of my dictionary implementation in C.

@EssGeEich,
I think it's better to have the name rather than a number, because it allows for user defined types. It does use more data, but not very much (in most cases).

I like C because it's very simple and I have full control. It can be a bit annoying when it's being difficult, but that applies to all languages.
Last edited on
I think it's better to have the name rather than a number, because it allows for user defined types.

You could "bind" a user-defined type to another ID. But well, your project, your rules, and you're also right. I'm suggesting. ^
Last edited on
It's not that I don't appreciate suggestions, I just think it's easier that way.
Topic archived. No new replies allowed.