Making a programming language

1. Do you have any experience in compiler theory? If not, either grab the Dragon Book, or write an expression interpreter in Yacc/Bison. The latter is a fairly good starting point, because you can get useful results pretty quickly while you get acquainted with the internal details of the compilation process, some CT terminology, and, most important of all, Backus-Naur Form.
2. Generating native code is quite a daunting task. If you take a look at modern, native, very high level language compilers, they generate C and pass the output to gcc. There's a lot of low level bureaucracy to be dealt with that isn't directly related to the main problem (the compiler itself). If you think you're up for it, do it, but I wouldn't recommend it.
Another frequent solution is to generate bytecode and then run it on a virtual machine. That avoids such problems as parsing the same statement more than once, which pure interpreters have.

I've not tried making a programming/scripting language yet, but I have made a very basic bytecode interpreter. They're incredibly simple (depending on your language).

The following is a skeletal example:

void runprogram(struct registers* reg, const char* buffer, size_t sz)
{
	for (; reg->ip < sz; ++(reg->ip)) {
		int tmp = 0;
		char instruction = buffer[reg->ip];
		
		/*
		 * Check for arguments to the instruction
		 */
		if ((tmp = get_opcode_argument_count(instruction)) > 0) {
			char arguments[tmp];
			
			for (int i = 0; i < tmp; ++i)
				arguments[i] = buffer[reg->ip++];
		
			do_instruction(instruction, arguments, tmp);
		} else {
			do_instruction(instruction, 0, 0);
		}
	}
}

Last edited on

helios wrote:
Generating native code is quite a daunting task. If you take a look at modern, native, very high level language compilers, they generate C and pass the output to gcc.

That's a quite adequate solution, it's actually so simple I wouldn't have thought of it myself. :P The "compiling process" would then be reduced to parsing strings from language "x" to C, and then let GCC do the tough work. I'll be looking into Yacc/Bison, too.
Could you explain exactly what bytecode is and how this has a relation to virtual machines?

Chrisname, thanks for the example, but I really need a bit of an explanation here. :P

Bytecode is just compiled code that is executed by a virtual machine. In my example, "buffer" would contain bytecode

Bytecode is analogous to native program code that's executed by a physical processor.

I just got a horribly good idea for the language I will "create":
eC (pronounced: Easy)
It will be C with a Basic-like syntax. When I looked back on the route to where I now am, gathering programming experience, I thought of where I started, reading an old book my father gave to me about his class on learning Basic (which was the way to go in those days). The syntax was clear as mud and although I didn't lay my hands on any compiler I could see the codes instantiate in front of me.
My focus will be to break down programming to "interpretable pseudocode", making it a good starting point for the absolute beginner. I wish to achieve this by implementing a string parser into the eC to C conversion (after which it will be sent to GCC for full compilation). To make eC even more easy to use, it will use language packages to help on the localization. Since the focus is not to create the most efficient code, but just to guide people in their very first steps, it will support only variables of the types unsigned char, long int and bool. A simple example code (using the English package) could be:

g = 45
loop 0 till 9 with i
present i+g
end loop

Using the Dutch lang-package, you could have the following code:

g = 45
lus 0 tot 9 met i
presenteer i+g
einde lus

Either would (when telling the eC "compiler" the right package to use) compile a C program that would show the values 45 till 54.
Since I don't want this to be a fancy language with infinite possibilities, it would be limited to standard I/O, file I/O, simple arithmetic (+, -, *, /, % and some logical operators), functions and other things.
This is all just an idea in my head (haven't even told my project partner about this) and there's lots of things I haven't thought about, just tell me what you think and possibly what to append or add.

Finally, at chrisname and helios:
Thanks, both of you, for explaining that term, just, what is reg and sz in your example, chrisname?

reg is a hypothetical structure containing the register state of the virtual machine (a register is a small area of storage (1-4 bytes on x86, 1-8 bytes on x86_64); reg->ip is the instruction pointer (tells the CPU where in memory the current instruction is) and sz is just the number of elements in the buffer.

With regards to having a multi-lingual language; while it is easier to learn for non-English speakers, I don't think it's such a good idea. Not only does it increase complexity of the compiler but it also introduces a language barrier within the language! Imagine asking someone for help with your eC code all nicely written with the Icelandic language pack. Very few people speak Icelandic, and given that the vast majority of people on the Internet are American (and many of those that aren't still primarily speak English) you would be hard-pressed to get help. I know you could solve this problem with a machine translator, but it doesn't seem worth it. You're then writing two translators: one to translate each natural language into a different natural language, and one to translate THAT code into C! And then you're invoking a third translator (gcc) to translate the C into ASM!

Disch (13742)

I know you could solve this problem with a machine translator,

This opens other problems, as language keywords might conflict with variable names.

For example "present" would be a keyword in the English language pack, but not in the Dutch language pack. So you could use "present" as a variable name in a Dutch program. If you translate that to English, all of the sudden your variable names are now confused with keywords.

Oh yeah; that complicates matters further. So yes, it's complicated. You can't work it. With that system, the only way you could be sure that you would never cause conflict would be to know EVERY language that the language has been translated to!

Alternatively, there should be some way to tell keywords and identifiers apart, possibly through some form of prefix or suffix.

Off-topic:
Although I said it as a joke, I think that idea I mentioned the other day in a different about a program being able to modify the compiler's parsing algorithm from its own code is fairly interesting. I don't think I've ever seen it implemented, probably because it's a recipe for abuse, but it might be useful as a didactical tool for teaching compiler theory.

Last edited on

I, myself, think this, together with the theoretical part is a big enough project to work on. I have, as stated before, not asked my working-partner for his opinion (because I don't really want to call him up this late) so it's not really sure if this will be the real project.. in fact, it's not really sure if he'll be my working-partner in the first place, because he had to redo his year. I'll have more info on that later this week, but let's discuss this matter as if it were chosen..
So, finally back on topic. :P
I do not seek to release this at any professional basis as C++. I seek to create a neat little package for kids to learn. For this reason I really want to add multi-lingual support. You are indeed right about the inner lying conflicts, but I do not consider this to be such a great danger, given that the average eC program will not be as complex as any C/C++ equivalent would/could be (which in the beginning is the purpose of eC, simplification).

Edit:
Didn't see your post yet, took me quite a while to type my message. :P
That would indeed be a possibility, for readability's sake, I would pick a prefix or suffix for the identifiers (to keep the sentence flow more natural, and thus more readable). And that program sounds interesting, you could make it store it's own source code in a string and recursively send it back to the compiler (changing it all the time). See what it comes up with after a few iterations would be.. interesting.

Last edited on

Yes, but it does make matters difficult for you. But, if you're willing to come up with solutions to the problems (after all, that is what programmers are supposed to do) then go ahead. It will be nice to see a multilingual programming language. You could call each language a dialect of eC, so you have an English dialect, a Spanish dialect, a Dutch dialect, etc.

I will try to implement it so, that languages are just plain text files, so that they can be downloaded separately (English being the default built-in language). Every line of the text file could describe a single keyword (using a strict order for how the keywords are placed inside of it), for example:

if
end_if
else
loop
till
with
end_loop

could be a part of the English dialect of eC.

closed account (EzwRko23)

3. Why create a new language, what are the motives to create one and what are the starting points when I create one from scratch (take that literally)?

New general purpose languages are often created to overcome some problems with existing general purpose languages, to increase programmer's productivity, to reduce probability of hard to find bugs, to let programmers concentrate more on solving the problem instead of on the machine.

1: C was created to overcome problems with architecture portability of assembly and also to make easier using some patterns: loops, branches, conditions, structures etc.

2: C++ and ObjC were created to provide OOP capabilities for C programs, because people thought some time ago that OOP is a better way of programming than structural programming. The key value of OOP is encapsulation of program state, which is much easier to control than global state in structural programs.

3: Java was created to reduce complexity of C++, especially complexity of manual memory management, got rid of undefined behaviour, provided stronger type system, encapsulation, modularity and security mechanisms, suitable for large scale development. Java also introduced binary portability between platforms and architectures.

4: C# was created as a better Java. They fixed e.g. generic type erasure. They also provided better integration with Windows desktop and more syntactic sugar. Hovewer, C# and Java are very close to each other.

5. Scala was created to provide true support for functional paradigm to Java; which is considered superior for large scale software development by many because of powerful abstractions it gives (better "glue" between various parts of the program). It also provides better model for generic programming and parallel programming than C#/Java/C++/C.

Some claim this series converges to LISP. So, whatever language you create, LISP had all its features 40 years ago. :P

Where to start?
If you want to do it seriously, learn at least a few languages to know their weak and strong points. Then learn some compiler theory and / or try to use one of existing parser generators (e.g. ANTLR, which is great for beginners because it has even its own grammar debugger and IDE).

I discussed this topic with my working partner today, he was quite enthusiast and we went on about how we would get this thing to work, as we talked we came to the following simple notes on what we are going to do or use:

Things that are for sure:
String parser written in C#
eC to C++ (via G++)
Possibility to maintain the outputted C++ code and to change it before full compilation
Automatic library addition based on the used functions at compile time
Data dictionary (auto-fill possibility)
Syntax checking via C# (actual eC code) and the log files that G++ produces
Open source and compatible (Mainly Windows and Linux)
GUI in Qt4, if not possible, we will limit ourselves to Windows
Fundamental types are: long int, double, bool and unsigned char (and Unicode strings)
Support for iostream, fstream, string and possibly sstream
NOT OOP (no custom-classes, structs, unions or enums)

Possible functionality:
Language packages

// rest behind this is about the school related stuff, which is not really important in here

... We will also add easter eggs. :P

Last edited on

Syntax checking via C# (actual eC code) and the log files that G++ produces

I don't think I understand that last part.
Also, syntax checking is done by the parser, so if you already said the parser will be in C#, you don't really have much of a choice.

Fundamental types are: long int, double, bool and unsigned char (and std::string)

May I suggest using Unicode strings instead of single byte strings?

Saying via C# is kind of redundant, just like you noted. Also, things might get out of hand after the string parser and it might not directly produce an error code, we will use G++ as a final compiler, so it's logical that G++ can produce errors too. These must in turn be sent to the eC IDE and displayed there.
Unicode strings are indeed better to use, thanks for the advice!

Last edited on

we will use G++ as a final compiler, so it's logical that G++ can produce errors too.

Only if your compiler is doing a lousy job. Every valid source program should produce valid target code.
You've never seen a C++ compiler throw undecipherable Assembly errors at you, have you?