Compiler construction: Handle references to unordered symbols

Question

I've got the dragonbook but it doesn't seem to handle that topic...

In the most modern languages it's possible to use certain variables even if their appearance in the code is unordered.

Example

class Foo {
    void bar() {
        plonk = 42;
    }
    int plonk;
}

It doesn't matter that the variable plonk is declared after the function.

Question
Is there any best practice/useful pattern how to implement this? There are two approaches which cam to my mind:

While parsing add dummy symbols for unseen symbols. When the declaration is parsed those dummies get replaced by their real symbols. After the parsing we can check if there are dummies left and if so output an error.
Don't do any symbol stuff while parsing but only create the AST. After parsing step through the AST and depending on the node add symbols. For e.g. a class node add symbols of the children and process them after. For e.g. statement blocks step through children and add symbols immediatly before the child is processed.

I would expect approach 1. is easier and also more useful for stuff like "importing other compilation units".

Edit:
A problem i see with approach 1 is that there needs some kind of handling for ordered symbols. E.g. withing a function it's not possible to use a local symbol before it is used.

Build an AST, that's much cleaner and gives you a lot more flexibility. — , Jul 20 '13 at 22:00
@H2CO3: But that would make it necessary that a compilation unit which is imported is already compiled - so i know all symbols right? — Daniel, Jul 20 '13 at 22:05
I don't exactly follow. You don't need compilation - you only need the declarations. Compilation is not linkage! — , Jul 20 '13 at 22:16
Okay my fault - replace compilation with parsing. Those error are (as far as my knowledge goes) normally not handled by the linker but by the compiler (libraries excluded). — Daniel, Jul 20 '13 at 22:20
Ah, I see what you mean! Yes, if you build an AST, you have to parse the entire compilation unit. And no, most compilers don't work in one pass (there are a few exceptions, e. g. Lua's compiler). They do the entire parsing, where only syntax errors are checked, then the next phase is walking the AST, and checking for more semantics-related errors (for example, undeclared variables). — , Jul 20 '13 at 22:22
Hm yes makes kind of sense. Do you know some (toy-) compiler where this strategy is implemented? There are still a lot of open questions for me. — Daniel, Jul 20 '13 at 22:33
Huh, well, not off the top of my head. However, I am currently in the process of developing a simple scripting language. The source code will be on my GitHub shortly, you might find it interesting. — , Jul 20 '13 at 22:35
okay - would be nice if you could the link here if it's ready - could be useful for others too :) — Daniel, Jul 20 '13 at 22:41

rici · Accepted Answer · 2013-07-20T22:47:25.757

If you can, just build the AST and the symbol table during the parse. Then make a pass over the AST to associate symbols with symbol table entries. That's essentially your strategy #2.

The problem with strategy #1, in the general case, is that you don't necessarily know that two instances of the same name are bound to the same symbol, until you see all the declarations. Consider, for example, a language like javascript in which the binding domain for a symbol is a function block (a mistake IMHO, but tastes vary) but symbols do not need to be declared before use. In this case, we'll only consider symbols which name functions.

Pseudocode (legal javascript, as it turns out):

function outer() {
  return foo();

  function inner() {
    return foo();

    function foo() {
      return "inner's foo";
    }
  }

  function foo() {
     return "outer's foo";
  }
}

The two uses of foo refer to different symbols, something you can't know until you reach the last definition of foo.

The problem with strategy #2 is that it is not always possible to build an AST without knowing something about the symbols being used. For example, in C you can't really parse an expression like (x)(y) without knowing whether x is a typename or a something which can be dereferenced into a function. (Also a mistake, IMHO, but who am I?). In C++, you also need to know whether a given symbol is a template or not. Often, this is described as the "kind" of a symbol, as opposed to "type". In C++, you don't need to know what the "type" of x is to parse (x)(y); you just need to know whether or not it has one. For this reason, C++ allows use of certain symbols before declaration, but not if the declaration is a typedef.

Leaving pathological cases and macro processors aside, it is usually possible to define scopes during the parse, and attach each declaration to a scope. Normally scopes nest in a fairly simple manner, so once you've built the scope tree you can look up any symbol given the current scope node, just by walking up the tree until the symbol is found.

In some languages (like python), declarations are optional and implicit; in such a case, you can attach a new definition to the current scope in a second pass if the symbol is not found.

Interesting example. There shouldn't be any completly different semantics based on the symbol-types (at least I'm trying to design the language that way). Do you have any compilers in mind which implement that strategy? — Daniel, Jul 20 '13 at 22:45
@Daniel: most toy languages, and many real ones, require declaration before use. Of those which don't, many (like some javascript implementations) do lookup on use, which is really inefficient. Compilers which allow declaration after use tend to be fairly complicated. However, any good compiler tutorial should explain how to build a symbol table, and the algorithm I suggest in my answer is pretty straight-forward. Good luck. — rici, Jul 20 '13 at 22:59

Compiler construction: Handle references to unordered symbols

1 Answers1