5

I am looking for a way to implement an S-expression reader (to be used later on with both an Scheme interpreter and a compiler), but I've been asking myself how (if at all) I should write an AST for it.

I've been reading SICP, and this is quite straightforward from within Scheme, but I'm looking to implement the interpreter and compiler in C++, in OO fashion.

Please, keep in mind that I'm doing this only for learning purposes, so I'm not really looking for the easiest or quickest way of doing this, but rather the correct and reusable way of doing it.

I've seen in some Scheme implementations that people parse s-expressions and readily output cons cells, something like this:

  struct Sexpr
  {
  };

  struct Cons : public Sexpr
  {
    Sexpr* left;
    Sexpr* right;
  };

  struct IntAtom : Sexpr
  {
    int value;
  };

And one subclass of Sexpr for each kind of Scheme Atom, or something along those lines.

I'm not sure, but this seems like a hack to me... Shouldn't this work be done by an interpreter rather than the reader?

What I want to know is if this is considered the best (or correct) way of reading S-expressions, or is this more a job of the interpreter than the parser? Should the parser have its own AST instead of relying on cons cells?

ivanmp
  • 519
  • 1
  • 5
  • 13
  • If I'm reading this right, the question doesn't have much to do with parsing. Rather, I think you're asking: what's the most appropriate representation for the s-expression data type. Do you agree? – dyoo Feb 25 '12 at 20:22
  • @dyoo Yes and no. Yes, you're right, I am looking for the most appropriate representation for s-expressions. And no, you're wrong, this question obviously does have to do with parsing. If I were only looking for the most appropriate representation for sexpr, there would be no doubt that it'd be cons cells. However, I'm looking for the most appropriate representation of sexpr specifically **for parsing**. – ivanmp Feb 26 '12 at 12:12
  • cool. Good clarification. Then: one thing that might distinguish the parsing task is the need for source location information. Plain cons cells don't remember where they came from in the original source. During parsing, you might want to support error messages that can point to the source. What other things might we need for parsing? – dyoo Feb 26 '12 at 15:55
  • @dyoo That's not it. What I'm trying to distinguish here is that `cons cells` are a _runtime structure_. It's a semantic part of the language, not a syntactic one. Since I want to have a clean separation of compilation/interpretation phases, I don't want my parser dealing with semantic stuff. – ivanmp Feb 26 '12 at 17:21
  • @dyoo Though I should add that, for Lisps, the line that separates them seems pretty blurred, so I may be pushing it too much. – ivanmp Feb 26 '12 at 17:24

4 Answers4

3

If you want to be somewhat complete in your syntax, you will need to support

sexpr ::= atom | sexpr sexpr
atom ::= nil | intatom | etc.

But that is more general than most sexpr you will encounter. The easiest and most common form of S-expr which in LISP/Scheme is like (a b c d) where each of a,b,c,d are atoms or lists. In pair form this is [a [b [c [d nil] ] ] ], which means all right sides of your conses are lists.

So if you are going for clean, you might just do

class sexpr {};
class atom : sexpr {};
class s_list : forward_list<smart_ptr<sexpr>> {};
Alan Baljeu
  • 2,383
  • 4
  • 25
  • 40
  • No problem! That is something along the lines of what I had in mind, but I have two questions: 1) does it mean that I should not worry about `cons cells` at this point (parser) and leave it as a problem to the interpreter, and use this kind of AST as you've shown instead? 2) I may have gotten something wrong, but shouldn't s_list inherit from Sexpr? – ivanmp Feb 24 '12 at 18:12
3

While one can probably argue back and forth over what the “correct” approach is, in my opinion, the approach you suggest—using the same data structures for reading, compilation, evaluation, and processing—is the one that will teach you the most about what Lisp and the “code is data” mantra are about, and in particular, what the quote operator actually means (which is quite a profound thing).

It is also, incidentally, the way most Lisps (interestingly, not including Scheme) traditionally work.

So yes, have the reader generate Lisp data: conses, symbols, Lisp numbers, strings, et cetera, the exact same stuff the user-level Lisp code will deal with. It will make the rest of the implementation both simpler and more instructive.

Matthias Benkard
  • 15,497
  • 4
  • 39
  • 47
  • Thank you for your answer! My primary goal is to learn the most about compilers and interpreters in general, not necessarily Lisps, that's why I'm wondering about the most correct approach to this. I chose Scheme because I thought it is was a rather simple language to start with (at least some subset of it), I've worked with it and, last but not least, I like it too. :) What do you mean by your last line? – ivanmp Feb 24 '12 at 18:21
  • @ivanmp I meant that if you're trying to learn about Lisp, it's more *instructive* to do it this way. Yes, *instructive* is the word I was looking for. :) The implementation will also be simpler because you don't have to deal with different data structures for different phases, and some data can simply be “handed through” the phases without having to convert them into anything else (as in the case of `quote`). On the other hand, if you're trying to learn about compilation in general, not Lisp in particular, either way is probably fine. – Matthias Benkard Feb 24 '12 at 18:27
  • 1
    I meant this line: _It is also, incidentally, the way most Lisps (**interestingly, not including Scheme**) traditionally work._ – ivanmp Feb 24 '12 at 18:36
  • 1
    @ivanmp Ah, yes, that was indeed the last line before the edit. :) Yes, I think it's interesting to note that while most Lisps' semantics are defined on *data* (i.e., conses and atoms) with the reading step being completely separate from the compilation, Scheme's semantics is defined on the textual representation of the code. (See the R5RS, in particular the [“Syntax” section](http://www.schemers.org/Documents/Standards/R5RS/HTML/r5rs-Z-H-4.html#%_sec_1.2), which clearly distinguishes between syntax of code and syntax of data.) – Matthias Benkard Feb 25 '12 at 12:58
  • 1
    @ivanmp To illustrate the difference, contrast what Scheme understands as “code” with the [meaning of the term in Common Lisp](https://matthias.benkard.de/journal/107). – Matthias Benkard Feb 25 '12 at 13:03
3

To follow up from the Scheme/Racket side of the fence:

Racket (and some other Scheme implementations) use a richer representation for syntax objects, so that they can have properties attached to them indicating (in Racket, at least) what context they're bound in, what source location they come from, what pass of the compiler inserted them, and any other information you might want to store (cf. "syntax properties" in Racket).

This additional information enables things like error messages with pointers to source, and hygienic macros.

Please note that I mean "richer" here simply in the "contains more values" sense, not in any non-value-neutral way.

I should also add---before falling into the Turing Tar Pit---that you can also represent this exact same information using a table on the side; assuming you have pointer comparisons, there's no expressiveness difference between putting a value inside a structure and using a table to associate the structure with the value.

John Clements
  • 16,895
  • 3
  • 37
  • 52
1

You might want to take a look at this c/c++ s-expr parser library for an example of how it has been done.

It looks like the base representation is:

struct elt {
  int type;
  char *val;
  struct elt *list; 
  struct elt *next;
};

And I quote from their docs:

Since an element can be either a list or atom, the element structure has a type indicator that can be either LIST or VALUE. If the type indicator is LIST, then the structure member "list" will be a pointer to the head of the list represented by this element. If the type indicator is VALUE, then the structure member "val" will contain the atom represented by the element as a string. In both cases, the "next" pointer will point at the next element of the current s-expression.

Additionally here is a whole list of other implementations of s-expr readers in lots of languages that may be of interest.

zippy
  • 1,228
  • 10
  • 19