I'm writing a grammar for a YAML-like serialization format. I'm using a LALR parser. I've hit a roadblock when parsing scalars. A scalar can be a string or a number (let's keep it simple and make it only decimals or floats). Here's what I have so far, I've kept only what's relevant here:
pair: pair_key ":" _value
_value: scalar | collection
scalar : (string | number) _NL+
string : WORD+
number : DECIMAL | FLOAT
DECIMAL : /0|[1-9]\d*/i
FLOAT: /((\d+\.\d*|\.\d+)(e[-+]?\d+)?|\d+(e[-+]?\d+))/i
WORD: /[^-:#()\[\]{}\n\s]+/
// NEWLINE
_NL: /(\r?\n[\t ]*)+/
%import common.WS_INLINE
%ignore WS_INLINE
A string is one or more words. A WORD can contain any characters except the ones I put in the negated set of the regex of WORD. I want my strings to be able to contain numbers and still be parsed as strings, that's why there are no digits in my negated set for WORD. The problem lies when a string begins with a number as such:
test_strings = """
a : 28 should be parsed as string
b : 28
"""
The parser can't decide between parsing a number or word when it sees 28 at the beginning.
Here's what I get:
top_map
pair
pair_key
string a
scalar
string
28
should
be
parsed
as
string
pair
pair_key
string b
scalar
string 28
Expected:
top_map
pair
pair_key
string a
scalar
string
28
should
be
parsed
as
string
pair
pair_key
string b
scalar
number 28
How do I go about resolving this ambuigity? Is there a way to do this using only the grammar? Note that I don't want my strings to be surrounded with quotes or other symbols to be able to identify them.
Edit
I've solved the problem using higher priorities on my number rule as such:
string : number WORD+ | WORD+
number.2 : DECIMAL | FLOAT
DECIMAL.2 : /0|[1-9]\d*/i
FLOAT.2: /((\d+\.\d*|\.\d+)(e[-+]?\d+)?|\d+(e[-+]?\d+))/i
WORD: /[^-:#()\[\]{}\n\s]+/
That way a number will be parsed as a number rather than a WORD. And strings that begin with numbers must have WORDs that come after. So there is no string that's just a number in this modified version.