Resolve ambiguity between strings and numbers in lark

Question

I'm writing a grammar for a YAML-like serialization format. I'm using a LALR parser. I've hit a roadblock when parsing scalars. A scalar can be a string or a number (let's keep it simple and make it only decimals or floats). Here's what I have so far, I've kept only what's relevant here:

pair: pair_key ":" _value
_value: scalar | collection

scalar : (string | number) _NL+ 
string : WORD+
number : DECIMAL | FLOAT
DECIMAL : /0|[1-9]\d*/i
FLOAT: /((\d+\.\d*|\.\d+)(e[-+]?\d+)?|\d+(e[-+]?\d+))/i
WORD:  /[^-:#()\[\]{}\n\s]+/

// NEWLINE
_NL: /(\r?\n[\t ]*)+/

%import common.WS_INLINE
%ignore WS_INLINE

A string is one or more words. A WORD can contain any characters except the ones I put in the negated set of the regex of WORD. I want my strings to be able to contain numbers and still be parsed as strings, that's why there are no digits in my negated set for WORD. The problem lies when a string begins with a number as such:

test_strings = """
a : 28 should be parsed as string
b : 28
"""

The parser can't decide between parsing a number or word when it sees 28 at the beginning.

Here's what I get:

top_map
  pair
    pair_key
      string    a
    scalar
      string
        28
        should
        be
        parsed
        as
        string
  pair
    pair_key
      string    b
    scalar
      string    28

Expected:

top_map
  pair
    pair_key
      string    a
    scalar
      string
        28
        should
        be
        parsed
        as
        string
  pair
    pair_key
      string    b
    scalar
      number    28

How do I go about resolving this ambuigity? Is there a way to do this using only the grammar? Note that I don't want my strings to be surrounded with quotes or other symbols to be able to identify them.

Edit

I've solved the problem using higher priorities on my number rule as such:

string : number WORD+ | WORD+
number.2 : DECIMAL | FLOAT
DECIMAL.2 : /0|[1-9]\d*/i
FLOAT.2: /((\d+\.\d*|\.\d+)(e[-+]?\d+)?|\d+(e[-+]?\d+))/i
WORD:  /[^-:#()\[\]{}\n\s]+/

That way a number will be parsed as a number rather than a WORD. And strings that begin with numbers must have WORDs that come after. So there is no string that's just a number in this modified version.

score 0 · Accepted Answer · answered May 29 '20 at 07:48

0

It sounds to me like you should keep the grammar as-is, and convert the strings to numbers, when valid, after the parse is done.

You could still use the explicit number rule where it could affect the context of the parse, but here the ambiguity is something that can be resolved afterwards, and that would be the simplest solution.

Another solution, just for completeness, would be to make the entire string a single regexp (i.e. it will also include the whitespace), and to make sure while writing it that it has to match more than just digits.

Something like:

CHAR: /[^-:#()\[\]{}\n]/
CHAR_ND: /[^-:#()\[\]{}\n\d]/
STRING:  CHAR_ND CHAR* | CHAR* CHAR_ND

answered May 29 '20 at 07:48

Erez

1,287
12
18

Thank you for your answer. If I understand correctly I should let go of my explicit number rule, then when visiting a string node, I should convert it to a number? If conversion worked then it must be a number, otherwise it's just a string? – d34n May 30 '20 at 07:53
Instead what do you think of my solution by increasing the priorities on my number rule? – d34n May 30 '20 at 08:08
Yes, you should disambiguiate post-parse. Increasing the priority on the number will cause strings to be misidentified as numbers, because the parser can't see what comes after. – Erez May 31 '20 at 09:46
This is probably what I'll have to do one way or another. You're right, increasing priorities on numbers will cause words that are numbers inside a string to be identified as numbers. And I will have to convert that number to a string post-parse. I'll see what I can do. Thank you for your answer and for your library :) – d34n May 31 '20 at 19:10

Resolve ambiguity between strings and numbers in lark

1 Answers1