5

I'm working on a project which involves optimizing certain constructs in a very small subset of Java, formalized in BNF.

If I were to do this in Java, I would use a combination of JTB and JavaCC which builds an AST. Visitors are then used to manipulate the tree. But, given the vast libraries for parsing in Haskell (parsec, happy, alex etc), I'm a bit confused in chossing the appropriate library.

So, simply put, when a language is specified in BNF, which library offers the easiest means to build an AST? And what is the best way to go about modifying this tree in idiomatic Haskell?

Vamshi Surabhi
  • 437
  • 4
  • 15
  • 2
    There's also a cool, more experimental possibility, using template haskell to construct parsers from BNF given in the source, called [bnfc-meta](http://wiki.portal.chalmers.se/cse/pmwiki.php/FP/EmbeddedParserGenerators). I don't know how usable it is for real-life applications, though. – phipsgabler Sep 10 '13 at 14:22
  • 1
    (In case you are new to Haskell...) There's no real need in Haskell or other modern functional languages like SML and Caml for a tree generator like JTB. **Algebraic types** are as concise as tree specifications for JTB or similar tools - and they are the standard means of defining data types in functional languages. In Haskell you can also auto derive many useful functions on data types like showing, serialization and structural equality. – stephen tetley Sep 10 '13 at 21:45
  • 1
    There are 2 advanced tools for AST manipulation and optimization: `uuagc` preprocessor of attribute grammars and `hoopl` generic library for composable fact lattice-based optimization. – nponeccop Sep 11 '13 at 09:32

5 Answers5

7

Well in Haskell there are 2 main ways of parsing something, parse combinators or a parser generator. Since you already have a BNF I'd suggest the latter.

A good one is alex. GHC's parser IIRC is written using this so you'd be in good company.

Next you'll have a big honking stack of data declarations to parse into:

data JavaClass = {
    className :: Name,
    interfaces :: [Name],
    contents :: [ClassContents],
    ...
 }
  data ClassContents = M Method
                     | F Field
                     | IC InnerClass

and for expressions and whatever else you need. Finally you'll combine these into something like

data TopLevel = JC JavaClass
              | WhateverOtherForms
              | YouWillParse

Once you have this you'll have the entire AST represented as one TopLevel or a list of them depending on how many you classes/files you parse.

To proceed from here depends on what you want to do. There are a number of libraries such as syb (scrap your boilerplate) that let you write very concise tree traversals and modifications. lens is also an option. At a minimum check out Data.Traversable and Data.Foldable.

To modify the tree, you can do something as simple as

ignoreInnerClasses :: JavaClass -> JavaClass
ignoreInnerContents c = c{contents = filter isClass $ contents c}
 --                           ^^^ that is called a record update
    where isClass (IC _) = True
          isClass _      = False

and then you could potentially use something like syb to write

 everywhere (mkT ignoreInnerClass) toplevel

which will traverse everything and apply ignoreInnerClass to all JavaClasses. This is possible to do in lens and many other libraries too, but syb is very easy to read.

daniel gratzer
  • 52,833
  • 11
  • 94
  • 134
  • I liked the clear answer. Even though I'm dealing with just a subset, it is too much of work to start writing data types from scratch. However, I plan to use syb for transformations. – Vamshi Surabhi Sep 12 '13 at 08:58
4

Alex + Happy.

There are many approaches to modify/investigate the parsed terms (ASTs). The keyword to search for is "datatype-generic" programming. But beware: it is a complex topic ...

http://people.cs.uu.nl/andres/Rec/MutualRec.pdf

http://www.cs.uu.nl/wiki/GenericProgramming/Multirec

It has a generic implementation of the zipper available here:

http://hackage.haskell.org/packages/archive/zipper/0.3/doc/html/Generics-MultiRec-Zipper.html

Also checkout https://github.com/pascalh/Astview

Bastl
  • 2,926
  • 5
  • 27
  • 48
4

I've never used bnfc-meta (suggested by @phg), but I would strongly recommend you look into BNFC (on hackage: http://hackage.haskell.org/package/BNFC). The basic approach is that you write your grammar in an annotated BNF style, and it will automatically generate an AST, parser, and pretty-printer for the grammar.

How suitable BNFC is depends upon the complexity of your grammar. If it's not context-free, you'll likely have a difficult time making any progress (I did make some success hacking up context-sensitive extensions, but that code's likely bit-rotted by now). The other downside is that your AST will very directly reflect the grammar specification. But since you already have a BNF specification, adding the necessary annotations for BNFC should be rather straightforward, so it's probably the fastest way to get a usable AST. Even if you decide to go a different route, you might be able to take the generated data types as a starting point for a hand-written version.

John L
  • 27,937
  • 4
  • 73
  • 88
2

You might also check out the Haskell Compiler Series which is nice as an introduction to using alex and happy to parse a subset of Java: http://bjbell.wordpress.com/haskell-compiler-series/.

ccatalfo
  • 96
  • 1
  • 3
2

Since your grammar can be expressed in BNF, it is in the class of grammars that are efficiently parseable with a shift-reduce parser (LALR grammars). Such efficient parsers can be generated by the parser generator yacc/bison (C,C++), or its Haskell equivalent "Happy".

That's why I would use "Happy" in your case. It takes grammar rules in BNF form and generates a parser from it directly. The resulting parser will accept the language that is described by your grammar rules and produce an AST (abstract syntax tree). The Happy user guide is quite nice and gets you started quickly: http://www.haskell.org/happy/doc/html/

To transform the resulting AST, generic programming is a good idea. Here is a classical explanation on how to do this in Haskell in a practical fashion, from scratch: http://research.microsoft.com/en-us/um/people/simonpj/papers/hmap/

I have used exactly this to build a compiler for a small domain specific language, and it was a simple and concise solution.

linse
  • 886
  • 6
  • 5