Parsing simple s-expressions

I'm trying to write a function to convert an s-expression in a string into a simple internal representation. I have a function to evaluate the internal data structures already, but I can't test it because I don't know how to parse the string.

The s-expressions are represented like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
namespace sexp_type {
	/** s-expression types. */
	enum t {
		none,    /**< No type. */
		list,    /**< A list. */
		literal, /**< A literal value (e.g. 5, 'a', "hello"). */
		symbol   /**< A symbol (e.g., +, if, car). */
	};
}
	
/** Accessible sexp_type::t. */
typedef sexp_type::t sexp_type_t;

/** An s-expression. */
struct sexp {
	sexp_type_t type; /**< The type of s-expression. */
	void*       data; /**< The data. */
	size_t    length; /**< The length of the data. */
};

'data' can point to a string (containing a quoted string, number or symbol name) or a list.

So, I need to convert a string like this: "(+ 1 (+ 1 (+ 1 2)))" into a series of nested sexps.

What's the best way of going about this?
Last edited on
It's kind of fun, actually. Give me a day to make something to work with your data type. Assuming I still have power I'll post back.


If I can't, the best way is a recursive function that uses a stack to gather nodes. Once you hit the end of the list, pop stuff off until you have everything consed together properly.

:O)


Oh, BTW, how struct do you want the syntax? |this is a symbol| and (a #|comments here?|# b) ;and here? .

'a' as a literal? or #\a ? (or both?)

() only? [] also? {} even?

#weird(stuff like vectors?)
Er, your sexp type is not optimally formatted for memory management. A few questions:

Do you want to maintain const-ness for the data?
You should use reference counting. Wanna?
Do you want null values to actually use an allocated sexp struct space?
Do you want to use malloc() or new?
(Is this C with namespaces or C++?)
This is for a shell with Scheme-like syntax, so I'm trying to keep it simple. To that effect, it'll be weakly typed (everything will be stored as a string and not converted until the last minute), only () will be used for sexps, every list will start with a symbol (be it a user-defined function, shell built-in or external command). There won't be multi-line comments, but single-line comments will start with #. For the time being, though, I'm going to make expressions fit in one line. Later I'll figure out whether the line is a full expression and if it is, I'll just save it and politely ask the user for the rest.

The language is C++ so I've been using new/delete. I sometimes like to write procedural code in C++ because it has a good balance of efficiency and high-level language and library features that C lacks. I often treat it as an easier-to-use C, only using OOP when I really want to.

Duoas wrote:
your sexp type is not optimally formatted for memory management.
Do you want to maintain const-ness for the data?
Do you want null values to actually use an allocated sexp struct space?

What do you mean by these questions?

You should use reference counting. Wanna?

Why would reference counting be a good idea?

I can post a link to the GitHub repo if you'd like to see it.
Last edited on
Though special syntax is not really necessary in scheme, it is common. An important concept is quoting.

'(if a b c) is the same as (quote (if a b c))

If you are writing a scheme interpreter, you definitely want to handle quoting. Which leads to one of scheme's quirkier syntaxes:

#\a is the character 'a'.

Now, it is pretty easy to write a parser that can tell the difference between 'a and 'a', but people accustomed to scheme will find it confusing to use 'a' instead of #\a . If you must have the c-style character syntax, do it in addition to the usual syntaxes, or make it very clear that this is not scheme. (More on that in a moment.)

This also goes to using # as a comment delimiter. Most scheme interpreters are smart enough to know that the very first line of a file may begin with something like
1
2
#! /usr/env myfoo \
-a -b -c 
and properly ignore it, but since the hash character is otherwise special, the semicolon ';' is used for end of line commentary. Again, it is entirely possible to modify the interpreter to use '#', but at this point a simple heuristic is not enough to make the two uses coexist. Either it is scheme or it is not.


every list will start with a symbol

OK, but impractical, especially as it costs nothing to drop that requirement.


... make expressions fit in one line. Later I'll figure out [how to make progressive input]

You might as well do this right at the start. It will save you grief later.


What do you mean by these questions?

An s-expression in scheme is actually a binary tree. (A list like an array, the kind of thing we C/C++ people are used to, is called a vector in Scheme.) This leads to the most important predicate possible: is a node a pair?

If it is a pair, then there are two pointers in the data, one to a 'car' node and one to a 'cdr' node, either of which may be any valid s-expression node.

If it is not a pair, then the data must be an atomic type, like a symbol or number. In your case, you are trying to split between 'literals' and 'symbols', but if you are using on-demand type conversion, there is no point in distinguishing between the two at the tree level unless you plan to encode a symbol as an index into a lookup table or something behind the scenes (your sexp struct doesn't appear to provide for this).

Hence, the list '(a b c) is actually the nested pairs '(a . (b . (c . ()))) , which is represented in memory as:

╔═══╤═══╗  ╔═══╤═══╗  ╔═══╤═══╗
║   │  ─╫─►║   │  ─╫─►║   │NUL║
╚═╪═╧═══╝  ╚═╪═╧═══╝  ╚═╪═╧═══╝
  │          │          │
  ▼          ▼          ▼
╔═══╗      ╔═══╗      ╔═══╗
║ a ║      ║ b ║      ║ c ║
╚═══╝      ╚═══╝      ╚═══╝

where
╔═══╤═══╗                     ╔═══╗
║   │   ║ is a pair node, and ║   ║ is an atom node
╚═══╧═══╝                     ╚═══╝

That NUL there could actually be represented as pointing to a canonical null node (a singleton); both representations are common.


All nodes and data in an s-expression are typically considered to be immutable. This allows the interpreter to apply some nice reasoning on the language. In terms of memory management, it also allows the interpreter to simply pass references around to data instead of deep copying everything. And once we do that, we need reference counting for proper garbage collection.


You may be reinventing the wheel. You should check out Racket, very nice dialect of Scheme that is both R5RS and R6RS compliant, and makes it very easy to write executables with embedded interpreters in it, defaulting to the dialect you want. The PLT people are also very interested in secure systems, so it is easy to modify the dialect to support your options and forbid dangerous stuff to your users.


I sometimes usually like to write procedural code in C++ because it has a good balance of efficiency and high-level language and library features that C lacks.

As you will, but you will make your task about a billion times easier for yourself if you use C++ stuff here by default. So unless you are writing something for a very restricted embedded device... in which case I again recommend you to Racket.

An s-expression is really only realizable as a procedural object, but you can manage all that with a very simple wrapper class so that your C++ users only see it as a nice, standard C++ class type.

Oh, a reference:
http://racket-lang.org/

Let me know what you want to do.
Oh, lest I be misquoted on the special syntax remark. Yes, scheme typically only works with some special forms, but I was talking about sugar. Google it. [edit]I know chrisname understands sugar. I'm talking to others happening by here.[/edit] Here's some stuff on special forms:

ftp://ftp.cs.utexas.edu/pub/garbage/cs345/schintro-v13/schintro_17.html


However, it is possible to get rid of the supid macro forms altogether. Of course, then it isn't really standard scheme anymore, but it is much cleaner.

http://mainisusuallyafunction.blogspot.com/2012/04/scheme-without-special-forms.html

Enjoy!
Last edited on
It's not Scheme, it's a shell with syntax that somewhat resembles Scheme's (I mostly said Scheme instead of just Lisp in general because Scheme is the one I'm most familiar with, and because it's very minimalist which I like). I'm not advertising it as a Scheme implementation, but as a shell that uses s-expressions. For example,
1
2
(define (script) (basename (car args)))
(echo script ": Hello, world"!)

would be equivalent to
1
2
script=$(basename $0);
echo $script ": Hello, world!"

in bash. For minor things like this it's simpler to use bash but as the complexity of the expressions increase, so does the value of using s-expressions.

Duoas wrote:
you will make your task about a billion times easier for yourself if you use C++ stuff here by default

I am using the STL and so on, I'm just not implementing the sexp as a class. I am implementing the shell's 'context' as a class so it can perform the look-up of shell variables and built-in functions.

Duoas wrote:
You may be reinventing the wheel. You should check out Racket, very nice dialect of Scheme that is both R5RS and R6RS compliant, and makes it very easy to write executables with embedded interpreters in it, defaulting to the dialect you want. The PLT people are also very interested in secure systems, so it is easy to modify the dialect to support your options and forbid dangerous stuff to your users.

That looks like a good idea, actually. I'll look into it.
Last edited on
Hey, I haven't had too much time to play with this, but if you are still interested I can finish something fun for you in a day or two.

How strict do you want to be with syntax? Does (   ) count as a null? or will you only accept () as the empty list?

Because of your requirements, what I have so far accepts progressive input, using a reentrant, non-recursive parsing algorithm.

Mutability is an issue, meaning that memory management is also an important issue. So what I wrote uses your basic structure with an added refcount item:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
	/** An s-expression node. */
	struct sexp_node {
		sexp_type_t type; /**< The type of s-expression. */
		void*       data; /**< The data. */
		size_t    length; /**< The length of the data. */
		size_t  refcount; /**< Memory management. */

	private:
		/** Base Ctor. Private so 'sexp' users can't mess with me. */
		sexp_node( sexp_type_t type, void* data, size_t length ):
		    type( type ), data( data ), length( length ), refcount( 1 )
		    { }

		/** Memory management.
		    The sexp node is reference-counted.
		    @return this
		 */
		sexp_node* inc_refcount() {
			refcount += 1;
			return this;
		}

		/** Memory management.
		    If all references to this sexp node are removed, we need
		    to delete THIS node. All referenced nodes have their
		    reference counts decremented.
		    @return this, which may be NULL
		 */
		sexp_node* dec_refcount();

	friend class sexp;
	};

AND it is all managed with the sexp wrapper class, which does all the proper memory management for you -- leaving you with the flexibility to violate or keep mutability concerns as you need, in addition to providing all the normal list handling primitives.

I would like to have made all those members private, and forced the user to work through the sexp class, but I don't know how much you need access here.

Your structure did leave some interpretation to data storage. Here are the types I inferred:

sexp_type::none
  --> null: data == NULL, length = 0

sexp_type::list
  --> pair: data == sexp_node[ length == 2 ]
  --> vector: data == sexp_node[ length > 0 ]

sexp_type::literal
  --> string: data == char[ length ]
  --> char: string of length 1
  --> numbers, etc: all strings

sexp_type::symbol
  --> symbol: data == char[ length ]

Alas, as I haven't yet the cash to recover all my lost data, I've had to build most of this from scratch and memory, but let me know if you want to see it still.

(Even if you don't, I was having fun playing with it.)
Topic archived. No new replies allowed.