1

My grammar file test.ebnf looks like,

start = identifier ;

identifier =
  /[a-z]*/ rest;

rest = /[0-9]*/ ;

When I run this grammar in the input "test1234", I want it to yield "test1234" as a single lexeme, but instead the AST looks like,

AST:
['test', '1234']

I've tried running with the nameguard feature set to false with no luck. How can I get this behaviour without writing the rule like identifier = /[a-z]*[0-9]*/?

Charles
  • 953
  • 1
  • 8
  • 19

1 Answers1

1

Grako will always return a list with one object per element on a rule's right hand side, except when there's only one element. Even when naming elements, multiple matches with the same name will return a list. Just concatenating the elements is not reasonable because their ASTs may be objects as complex as the project requires.

In your case, you can use a semantic action to join the identifier parts:

def identifier(self, ast):
    return ''.join(ast)

Or redefine the identifier rule to have a single element:

identifier
    =
    /[a-z]+[0-9]*|[a-z]*[0-9]+/
    ;

(Note the changes to the regular expression so it never matches an empty string).

Apalala
  • 9,017
  • 3
  • 30
  • 48
  • Thanks. I decided to use semantics actions here, begrudgingly though. I don't think this logic belongs in a semantic action, it's a surprise to look at the lexical rules and not realise they're being lexically munged by a semantic action in another file. When rules are added in the future by working from existing examples, it won't be clear how to get the same behaviour. Is there not a way to name subregex's so I can break complicated lexical elements down without having to add hacks to the semantic actions. I had this feature in Lex! – Charles Jan 03 '15 at 12:37
  • The result of a rule that parses is entirely the responsibility of semantics/semantic actions. It just happens that there are default semantics, which are quite well specified. The default result is a [parse tree](http://en.wikipedia.org/wiki/Parse_tree) in which sequences in right hand sides are represented as lists. Naming and semantic actions will create an [abstract syntax tree](http://en.wikipedia.org/wiki/Abstract_syntax_tree). This is all pretty much standard in parser generators, but **Grako** has the policy of trying not to contaminate grammars with semantics. – Apalala Jan 03 '15 at 19:44
  • Is there no way to name parts of regular expression for reuse? It would really helpful if I could say rule = /[a-z]SOME_RE[0-9]/ where SOME_RE is a regular expression, as I can in Flex. – Charles Jan 03 '15 at 20:34
  • You can use rule includes to reuse rule content. – Apalala Jan 13 '15 at 11:44
  • Sorry, I appreciate your help, but I don't follow how that solves my problem. I'm going to use a different library, I'm struggling too much with this one. – Charles Jan 14 '15 at 21:59
  • In Pascal, identifier is defined as `identifier: Letter {Letter | Digit}` and `Letter` and `Digit` are defined quite nicely. If we would be able to join stuff as @dune.rocks requested, there would be no pain at all. But we cannot, and it is very inconvenient. I cannot see how we define pointer, for example, as `PNTR: identifier '^'`, and so on. One cannot propagate regexes upward forever, it is just ugly. – dmitry_romanov Jun 15 '17 at 16:20
  • @dmitry_romanov The grammar syntax for Pascal you quote is one of many possible grammar syntaxes. Most syntaxes require that lexical elements be defined separately (for example ANTLR). Most of the syntaxes in the parser generators I know use regular expressions for lexical elements. The best definition for a particular language varies with the grammar syntax and the preferences of the designer. – Apalala Jun 15 '17 at 22:22
  • I agree, thank you. My point was that one can write `fragment rule: ..` instead of `rule: ...` (in ANTLR), and the fragments _are glued together_ . I didn't check it for the case described by @dune.rocks, though. I do not see how to glue stuff together without resorting to semantic actions in tatsu of grako. – dmitry_romanov Jun 16 '17 at 08:33