What does it take to make a language

Pages: 123
I have this really cool idea for a programming language, but the very idea of making a compiler for it seems scores out of my league. So, I was just wondering what it would take.
You need to build a scanner (lexical analyzer, tokenizer, whatever you want to call it), then a parser (hand coded recursive descent, or use yacc/bison, or boost::spirit -- though a non-trivial parser written in spirit may choke your compiler or computer), then implement the semantic actions. Assuming you meant compiler rather than interpreter, then you need to generate either assembly code that can be run through an assembler or generate the machine code directly. Lastly, you need to build an executable that your operating system can execute, which means implementing at a minimum a library that allows you to make system calls. Then you'll want to build some libraries...

In short, it is very non-trivial. Even using all of the tools available today -- flex and bison for the lexical analysis and parser generator -- it is still a _lot_ of work.
Ah, well I guess that was to be expected. Maybe I should dedicate this thread to what language kind of language I was thinking about?
I suggest to look into flex/bison if you haven’t already. I found them to be very useful tools. There is no better way to implement a parser for any text-based file format. Using a regular expression library such as boost::regex for this will really affect efficiency (loading huge files will take ages). On the other hand template-based libraries like boost::spirit are only good for trivial cases.

One thing to consider when writing your own compiler is choosing to generate code in some high-level language rather than assembler/machine code. For example, you could generate C code and then pass the output to a C compiler to make an exe for you.

Other option is to simplify even more and write some kind of preprocessor that runs on top of some compiler (C is a good choice as it is quite easy to parse). This preprocessor would process the input only partially, the rest would remain unchanged. This way you could implement extensions for an existing language (like, support for some other built-in-types, for example better string type for C).
What is your idea?
I myself have wanted to do a similar thing, but due to countless dilations in a lot of things - the acceptance for it being a school research project - I eventually dropped it. I looked into a few tools myself and found ANTLR to be a great tool to work with, since you can start making prototypes practically from the start. It has a - in my opinion - less steeper learning curve then tools like bison and yacc. You should still check those out though, as it's entirely your choice.
I encourage you to try and write an interpreter for your language. I'm currently writing one in C for a language I'm creating (mostly a subset of Python with a few changes) and it's being an enlightening experience.

Recommended reading: http://norvig.com/lispy.html , http://norvig.com/lispy2.html
Of course, parsing Lisp is a non-issue since its syntax (or absence thereof) leaves you with the parse trees ready to use, but you can see how the general structure of a simple interpreter works.

And it's worth checking out Stroustrup's calculator from PPP: http://www.stroustrup.com/Programming/calculator08buggy.cpp (it's an intentionally buggy program for an exercise, but you can get an idea of how to parse arithmetic expressions).
Well, that highly depends on the purpose of the language, filipe. If the language is not very complex and is for non-performance-sensitive applications and you don't mind requiring an interpreter to use its applications, then an interpreter is probably a good idea (and it's also easier to implement). If the language is either quite complex or for creating performance-sensitive applications, I'd suggest a compiler.

Examples of not very complex computer languages:
Perl, Python, LISP, Ruby, Mathematica, most shells, PHP, C (this one should be compiled, though).

Examples of complex computer languages:
C++, C++ variants, Objective-C, Objective-C++, Fortran, C#, (I hate to plug my own language, but...) COAL.

Examples of languages I didn't know in which category to put:
Most assemblies (because by all rights, these languages are only syntactically simple), Java, Ada, Scala, COBOL.

This is obviously a simplification of the possible cases, but I think you get the general idea.

Last edited on
COAL? I thought your language was POOL?
Albatross wrote:
Well, that highly depends on the purpose of the language, filipe. If the language is not very complex and is for non-performance-sensitive applications and you don't mind requiring an interpreter to use its applications, then an interpreter is probably a good idea (and it's also easier to implement). If the language is either quite complex or for creating performance-sensitive applications, I'd suggest a compiler.

While I agree with you, I think it might be best to write an interpreter as your first language implementation project because: the experience (and the lexer and the parser) can be used to write a compiler later; an interpreter is a simpler problem; and having an interactive environment to test a language is very useful, particularly when the language is subject to frequent change.

I have never written a compiler and I haven't even finished my interpreter (yet), so those are just impressions.
In my opinion, there has to be a need. When you identify a need which has yet to be fulfilled by some other language, then you can start thinking about implementing it.

I've wrote a C++ syntax-highlighting script in Perl once. That's ambitious enough for me.

EDIT: I thought I had a thread about it somewhere. (may have been under an old account). Here it is in action: http://chadmoore.us/source/boost/boost_1_45_0/boost/lexical_cast.hpp
Last edited on
Writing interpreters/compilers is fun, IMO you don't need a need.

But sometimes may be useful having your program interpret simple mathematical expression, writing a simple parser for those takes few minutes and can be done using only standard facilities.
( Not a full language but still quite close )
I'm just doing it for fun. I don't expect my little toy language to go anywhere, just to be able to run and write programs in it. And improve it as I like. I also really wanted to know how to write a language implementation.
By the way, a good reading is the Dragon book ( official title Compilers: Principles, Techniques, and Tools )
A must if you want to know what you are doing
The way I would describe my concept language is a mix between C++ and Java, with more emphasis on OOP than Java and a few extra features.

First and foremost is the paradigm: Everything save keywords and syntax is an inheritable type, including functions, pointers, references, built-in types, and even the object's type. Some of the built-in types I've thought of so far are:

complex - 9i would return complex(0,9)
long - 8 bytes long
wchar - same as wchar_t
byte - this is one of 3 types that aren't defined in an automatically included file
any - like boost::any, except that its type is the type of the object it's holding (1 of the 3 types not defined)
array - a multi-dimensional array, 1 of the 3... (takes place of new[] and delete[]) - this type is returned by {values,...}
string - an array optimized for characters (returned by "value")
typeid - I wasn't sure what to call it, so I went with the C++ standard. This is an objects type, returned by typeof(). Also, you can use this to cast objects.
function - an array optimized for commands. It can be defined 2 ways:
function<rettype> foo(T param1,U param2,...)={/*code*/}
//or this, for familiarity
rettype foo(T param1,U param2,...){/*code*/}

va_list - like the C++ va_list, except that it stores the type and number of parameters
pointer - an optimized array of bytes that stores an address (same overloads as C++ pointer)
reference - same as pointer, but the syntax is like a C++ reference

New keywords:
addressof(object) - returns the address of an object (pointer)
typeof(object/class) - returns the type of an object (typeid)
sizeof(object/class) - returns the size of an object (int)
dereference(pointer) - returns the object at the address (this is used in the pointer and reference classes)
execute(function) - runs an array of commands and stops at the end (used by function)
extend - used to extend the definition of a class:
 extend classname{
	type newmember;};

alias - takes the place of typedef
rename - renames a class or object (the old name is no longer valid)
undefine - undefines a class or class member
redefine - more of a shortcut because the same thing can be achieved with undefine and extend, but this redefines a class or class member
lexical_cast - same as boost::lexical_cast
self - self=*this

I don't think I'm going to include goto, register, union, auto, const_cast,

Changes to operators:
No new[] or delete[] (handled by array)
operator. is now overloadable

Changes to syntax:
I don't really like how case, public, protected, and private are used. If I were to make this language, instead of public: it would be public{/*members*/}

include <filename>- the same as #include, except it always appends a newline to the file being copied
link <DLLname> <Library name> - link the file to a DLL

I'm debating whether a macro preprocessor should be included

Also, I was thinking that a repeat(numtimes) control structure would be cool

So, what do you guys think?
Last edited on
with more emphasis on OOP than Java

So basically you have a pure OO language? That's really the only way you can be more OO than Java.

rename - renames a class or object (the old name is no longer valid)

??? This sounds like a great way to write code that is unreadable or screw up other code...

extend - used to extend the definition of a class:

Why not just work like namespaces in C++ where you can just put another one and it automatically attaches? Or classes in Ruby (maybe Python too idr).

execute(function) - runs an array of commands and stops at the end (used by function)

Does this mean it basically runs another interpreter or whatever on some strings? Sounds like you'd need to have a huge check here to prevent abuse like system().

operator. is now overloadable

So what about this:

struct T {
    void f(); //assume defined

struct S {
    void f(); //assume defined
    T operator .(); //assume defined

int main() {
    S obj;
    obj.f(); //which f() is called?
*Edit* Optional Reflection / runtime class creation would be a feature that should be added (imo). I can't delve into it too much though, as it's beyond what I'm capable of.
Last edited on
closed account (1yR4jE8b)
Ruby ftw, best OO language out there.
1. Yeah, I guess you're right
2. Now that you mention it, rename does sound like a bad idea
3. Like this?
class object{
    int id;
    double position;};

//user wants to extend the class
class object{
    any value;
    int operator+();};

4. execute wouldn't interpret a string, it would execute an array of opcodes, like compiled asm. Maybe I should just make function one of the types not defined in that file?
5. T::f() - this is why I included adressof(). That way if you wanted to access S::f() even if the & operator is overloaded, you could write adressof(obj)->f()
Last edited on
@Bazzy: thanks. I knew about that book already, and I certainly plan on reading it soon.

@PiMaster: are you familiar with any dynamic languages? You didn't mention that your language is statically typed, and I noticed no influence from them, so I wonder. I think there's a lot one can learn from them even if planning to make a statically typed language.

On a (perhaps not quite) separate note, why remove auto? Type inference can avoid a lot of unnecessary finger typing.
Pages: 123