1

For a parser I am creating, I use this regular expression as the definition of an ID:

ID: /[a-z_][a-z0-9]*/i

(For anyone who is not familiar with the syntax of the particular parser I'm using, the "i" flag simply means case-insensitive.)

I also have a number of keywords, like this:

CALL_KW: "call"
PRINT_KW: "print"

The problem is, due to some ambiguities in the grammar, sometimes keywords are treated as ID's, while I really don't want them to be. So I was thinking whether I could rewrite the regular expression for ID in such a way that keywords are not matched against it at all. Is such a thing possible?

To give some more context, I'm using the Lark parser library for Python. The Earley parser Lark provides (together with the dynamic lexer) are quite flexible and powerful in treating ambiguous grammars, but they sometimes do weird things like this (and non-deterministically, at that!). So I'm trying to give the parser some help here, by making keywords never matching an ID rule.

Emma
  • 27,428
  • 11
  • 44
  • 69
Elektito
  • 3,863
  • 8
  • 42
  • 72
  • Not really sure what you mean by sometimes keywords are treated as ID's, but you might set boundaries on the left and the right side of your pattern. Perhaps try a word boundary `\b` to pervent ID being part of a larger word or lookarounds if they are supported. – The fourth bird May 12 '19 at 10:52
  • Do you mean the parser patterns or lexer patterns? These are lexer patterns and obviously there are boundaries in the parser rules. The grammar is ambiguous however, and there are multiple choices. For example, the single line "else" can either be interpreted as calling a routine "else" or the keyword else. I want the lexer to never decide on the routine call, since this is a keyword. Also, I can't change the language itself, just how I parse and interpret it. – Elektito May 12 '19 at 11:37
  • Can't you force Lark to produce a non-contextual lexer using `lexer='standard'`, as indicated in the [docs](https://lark-parser.readthedocs.io/en/latest/parsers/)? Or do you depend on this feature elsewhere in the grammar? – rici May 12 '19 at 14:31
  • I can use a less intelligent lexer, but then "n-1" will be parsed as two tokens: "n" and "-1". So yeah, I'm otherwise dependent on the dynamic parser. – Elektito May 12 '19 at 14:38
  • 1
    I suppose you want `n-1` to be parsed as three tokens, as usual. Conventional wisdom is that there is really nothing to be gained by allowing numeric literals to be tokenised with a sign character. It's almost always better to always consider `-1` to be two tokens. You can (and should) constant fold after the parse. Essentially, it is surprising if `-1` and `- 1` turn out to be syntactically or semantically different in obscure contexts. – rici May 12 '19 at 14:45

2 Answers2

2

I believe Lark uses ordinary Python regular expressions, so you can use a negative lookahead assertion to exclude keywords. But you must take care to not reject names starting with a keyword:

ID: /(?!(else|call)\b)[a-z_][a-z0-9]*/i

This regular expression certainly works in Python3:

>>> # Test with just the word
>>> for test_string in ["x", "xelse", "elsex", "else"]:
...   m = re.match(r"(?!(else|call)\b)[a-z_][a-z0-9]*", test_string)
...   if m: print("%s: Matched %s" % (test_string, m.group(0)))
...   else: print("%s: No match" % test_string)
... 
x: Matched x
xelse: Matched xelse
elsex: Matched elsex
else: No match

>>> # Test with the word as the first word in a string
... for test_string in [word + " and more stuff" for word in ["x", "xelse", "elsex", "else"]]:
...   m = re.match(r"(?!(else|call)\b)[a-z_][a-z0-9]*", test_string)
...   if m: print("%s: Matched %s" % (test_string, m.group(0)))
...   else: print("%s: No match" % test_string)
... 
x and more stuff: Matched x
xelse and more stuff: Matched xelse
elsex and more stuff: Matched elsex
else and more stuff: No match
rici
  • 234,347
  • 28
  • 237
  • 341
  • Not sure I get this. What is the `\b`? With it there, this doesn't work at all. Without it, "else" doesn't match and "xelse" does, which is good, but "elsex" doesn't match, while I need it to. – Elektito May 12 '19 at 17:08
  • @elektito: it's a word boundary match. I have no idea why it doesn't work but I'll see if I can figure it out when I get home. It should fix the problem with "elsex". – rici May 12 '19 at 19:03
  • Is that something that needs to be in the input text? Because, I have no control over that. – Elektito May 12 '19 at 20:58
  • @elektito: it matches the boundary between a word and a non-word, which is of length zero but is certainly "something in the text" in the sense that it matches a feature of the text. That is, it matches the boundary between a word character (`[a-zA-Z0-9_]`) and a non-word character (anything else or the end of the string). See https://docs.python.org/3.7/howto/regex.html#more-metacharacters for a longer explanation. – rici May 13 '19 at 00:33
  • @elektito: also the regular expression certainly works in Python, and I can't help you with the problem you have using it with Lark unless you show me the code which fails. – rici May 13 '19 at 00:33
  • Well, I tried your code snippet, and it certainly works. Thanks. – Elektito May 13 '19 at 07:09
0

There are several ways to not pass your similar values to your IDs.

RegEx 1

You can for instance use capturing groups in your expression, maybe something similar to:

    ([a-z]+_[a-z0-9]+)

enter image description here

RegEx Circuit

This link helps you to visualize your expressions:

enter image description here

RegEx 2

Another way would be to bound your expression from the right using the :, then you can use an expression similar to:

(\w+):

enter image description here

or your original expression with an i flag:

([a-z0-9_]+):

You can add more boundaries to it, if you wish.

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • Unfortunate I don't think I understand your approach. Could you please provide a regular expression that matches all of those `[a-z][a-z0-9]*` matches, but not a few keywords like `call` and `print`? – Elektito May 12 '19 at 19:27