0

I'm hard-coding a recursive decent parser, mostly for learning purposes and, I've run into some trouble.

I'll use this short excerpt from the CSS3 grammar as an example:

simple_selector = type_selector | universal;
type_selector = [ namespace_prefix ]? element_name;
namespace_prefix = [ IDENT | '*' ]? '|';
element_name = IDENT;
universal = [ namespace_prefix ]? '*';

First, I didn't realize that namespace_prefix was an optional part within both the type_selector and universal. That led to the type_selector always failing when fed input like *|* because it was blindly being considered for any input that matched the namespace_prefix production.

Recursive decent is straightforward enough but my understanding of it is that I need to do a lot of (for lack of better word) exploratory recursion before settling on a production. So I changed the signature of my productions to return Boolean values. This way I could easily tell whether a specific production resulted in success or not.

I use a linked list data structure to support arbitrary look-ahead, and can easily slice this list to attempt a production and then return to my starting point if the production doesn't succeed. However, while trying out a production, I'm passing along mutable state, trying to construct a document object model. This isn't really working out because I have no way of knowing whether the production will be successful or not. And if the production isn't successful, I need to somehow undo any changes made.

My question is this. Should I use an abstract syntax tree as an intermediate representation and then go from there? Is this something you would commonly do to work around this problem? Because the issue seems to be primarily with the document object model not being a suitable tree data structure for recursion.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
John Leidegren
  • 59,920
  • 20
  • 131
  • 152

1 Answers1

1

I'm not intimately familiar with CSS, but in general what you would do is refactor the grammar to eliminate ambiguities as much as you can. In your case here, the namespace_prefix production that can be at the beginning of both type_selector and universal can be pulled out in front as a separate optional production:

simple_selector = [ namespace_prefix ]? (type_selector | universal);
type_selector = element_name;
namespace_prefix = [ IDENT | '*' ]? '|';
element_name = IDENT;
universal =  '*';

Not all grammars can be simplified for simple look-ahead like this, though, and for those you can use more complicated shift-reduce parsers, or - as you suggest - backtracking. For backtracking, you usually just attempt to parse productions and record the path through the grammar. Once you have a production that matches the input you use the recorded path to actually perform the semantic action for that production.

  • I've considered this but, it doesn't really change anything. The grammar isn't any more or less ambiguous because of it, the productions are still there. And I really enjoy the nature of recursive decent parsing. I'm mostly interested in how to incorporate an AST to simplify the recursive decent code. – John Leidegren Mar 29 '11 at 19:34
  • Of course you cannot be infinitely expressive with just any grammar, you need to take great care when designing your language. But in this case, it's a simple matter of chosing productions using look-a-head. – John Leidegren Mar 30 '11 at 05:13