1

I've been studying the grammar and AST nodes for various languages with AST explorer. With python, I've noticed that some form of semantic analysis is taking place during the parsing process. E.g.

x = 2
x = 2

Yields the following AST consisting of a VariableDeclaration node and ExpressionStatement node.

Python AST

So when the first x = 2 line is parsed, it checks a symbol table for the existence of x and then registers it and produces a VariableDeclaration node. Then when the second x = 2 line is parsed, it finds that x is already defined and produces a ExpressionStatement node.

However when I attempt to use the following semantically incorrect code:

2 + "string"

It accepts the code, and produces a ExpressionStatement node - even though it is semantically incorrect i.e. int + string, and rightly so produces an error when I attempt to execute it with a python interpreter.

This suggests to me that semantic analysis takes place twice: once during the parsing process and once again while traversing the complete AST. Is this assumption correct? If so why is this the case? Wouldn't it be more simple to do the entire semantic analysis phase during parsing instead of splitting it up?

Tom
  • 1,235
  • 9
  • 22

1 Answers1

2

The semantic error in the statement 2 + "string" is not detected in any semantic pass whatsoever. It is a runtime error and it is reported when you attempt to execute the statement. If the statement is never executed, no error is reported, as you can see by executing the script

    if False:
        2 + "string"
    print("All good!")

Resolving the first use of a global variable as a declaration is more of an optimisation than anything else, and it is common for compilers to execute multiple optimisation passes.

There is always a temptation to try to combine these multiple passes but it is a false economy: walking an AST is relatively low overhead, and code is much clearer and more maintainable when it only attempts to do one thing. Intertwining two unrelated optimization heuristics is poor design, just as is intertwining any set of unrelated procedures.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks for the response rici. So from what I understand, the bare minimum amount of semantic analysis should occur in the parser to determine the final AST e.g. should x=2 be a ExpressionStatement or VariableDeclaration node. Following the parsing stage, the AST should then be walked and further semantic checks should take place to determine stuff like is 2 + "string" valid? – Tom Feb 18 '21 at 19:58
  • And yes, python was perhaps a bad example since it semantic checks occur at runtime. But for example the java parser also allows for 2 + "string" to be valid. – Tom Feb 18 '21 at 19:59
  • @tom: there are languages where runtime values have type information (like Python and JavaScript), making runtime checks possible (or even necessary). And there are languages (like C and C++) where type information is strictly compile-time and so the compiler must fully deduce the type of every expression. And there are languages like Java with a bit of each. It's very hard to compare these models because they do things so differently. Moreover, "semantic information" covers a lot of ground. Should a compiler attempt to catch division by zero in cases where it could? Some do, many don't. – rici Feb 18 '21 at 20:26
  • For me, variable scoping is syntactic, not semantic, because nothing that happens at runtime could change a the scope or binding of a name (except for languages like Common Lisp and Perl which have dynamic scope. Yuk.) Type analysis is in the same general area, except for languages with duck-typing, so the syntax/semantics boundary is more fluid. I don't think the discussion is very fruitful. Compilers can figure out some stuff, and to the extent that the analysis can save runtime, it's probably worth trying to do. How it's done is an internal design decision. – rici Feb 18 '21 at 20:32