Standard format for concrete and abstract syntax trees

Question

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.

Is there a standard format for describing either concrete or abstract syntax trees? Do any widely-used tool chains support outputting to these formats?

I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.

I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.

Thanks!

The ANTLR answer from @vs is compelling, but a standard format which skips the code generation complexity might be preferable. I'll wait a day or so before marking the answer. — Brandon Bloom, Feb 17 '09 at 10:35

Ira Baxter · Answer 1 · 2012-04-15T15:47:24.803

The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another. See http://www.gupro.de/GXL

More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees. See http://www.omg.org/spec/ASTM/1.0/Beta1/

This problem seems to get solved over and over again. There's half a dozen "tool bus" proposals made over the years that all solved it, with no one ever overtaking the industry. The problem is that a) it is easy to represent ASTs using any kind of nestable notation [parentheses like LISP, like XML, ...] so people roll their own solution easily, and b) for one tool to exchange an AST with another, they both have to agree essentially on what the AST nodes mean; but most ASTs are rather accidentally derived from the particular grammar/parsing technology used by each tool, and there's almost always disagreement about that between tools. So, I've seen very few tools that exchange ASTs meaningfully.

If you're doing a hobby thing, I'd stick with a lisp-like encoding of trees, where each node has the following format: ( ... ) Its easy to generate, and easy to read.

I work on a professional tool to manipulate programs. If we have print out the AST, we do the above. Mostly individual ASTs are far too complicated to look at in practice, so we hardly ever print out the entire AST, at best only a node and a few children deep. Our tool doesn't exchange ASTs with anybody (see above reasons :) but does just fine building it in memory, doing whizzy things with it for analysis reasons or transformation reasons, and then either just deleteing it (no need to send it anywhere) or regenerating the original language text from the tree. [The latter means you need anti-parsing or "prettyprinting" technology]

“This problem seems to get solved over and over again. There's half a dozen "tool bus" proposals”: what's your opinion on the OMG's ASTM in particular? Side note: the ASTM is not a proposal any‑more, it's now a spec’. See http://www.omg.org/spec/ASTM/ . — Hibou57, Jul 16 '14 at 11:40
Yes, I saw the ASTM idea as it started into development as a standard back in 2005. They tried to define just "general abstract" syntax trees (GASTM) with abstract operators like "ADD", etc. but you soon discover that what "ADD" means in Fortran isn't the same as "ADD" in Java (can handle strings) or ADD in APL/J (generlized addition of matrixes of dimension M to matrices of dimension N). So how on earth do you write an general analyzer? ... — Ira Baxter, Jul 16 '14 at 12:41
But like everybody else (the tool bus folks), they discovered (one more time) that they needed syntax trees that matched what specific parsers did ("SASTM") because no parser produces a GASTM directly, and effort to translate between the specific syntax tree SASTM and the GASTM is just too hard. What I know is that I have tools that process some 40 languages including parsing, prettyprinting and transformation, including C++11, and ASTM is still not being used for very much that I can see. Can you name any tools or products based on it? — Ira Baxter, Jul 16 '14 at 12:42
“Can you name any tools or products based on it?”: I don't know any and I indeed saw criticisms similar as yours on lambda‑the‑ultimate. Popularity is one thing, the reason‑why is another (multiple not very popular things are mostly good). Now for abstract operation, may be that's drifting from abstract syntax tree to abstract semantic graph. Thanks for your comments. — Hibou57, Jul 16 '14 at 18:53

score 3 · Accepted Answer · answered Feb 17 '09 at 09:49

In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.

This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.

I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.

You might also want to look at ANTLR's tree grammars.

ANTLR looks promising! The "Grammar List" seems like a great starting point. I'll look deeper tomorrow. My goal is the tree data structures, I'd assume from the runtimes . — Brandon Bloom, Feb 17 '09 at 10:34

score 1 · Answer 3 · answered Jul 16 '14 at 20:31

Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.

For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.

Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML .

I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.

Standard format for concrete and abstract syntax trees

3 Answers3