0

I have an atom rule that tries to parse everything as either a number or a quoted string first, if that fails, then treat the thing as a string.

Everything parses fine except one particular case that is this very specific string:

DUD 123abc

Which fails to parse with Expected " ", "." or [0-9] but "a" found. error.

What I expect: it should parse successfully and return string "123abc" as a string atom. You can see several of my unsuccessful attempts commented out in the grammar content below.

Any help/tips/pointers/suggestions appreciated!


You can try the grammar on the online PEG.js version. I'm using node v0.8.23 and pegjs 0.7.0

Numbers that parses correctly:

  • `123
  • `0
  • `0.
  • `1.
  • `.23
  • `0.23
  • `1.23
  • `0.000
  • . <--- as string, not number and not error

I want 123abc to be parsed as a string, is this possible?


This is my entire grammar file:

start = lines:line+ { return lines; }

// --------------------- LINE STRUCTURE
line = command:command eol { return command; }

command = action:atom args:(sep atom)*
{
  var i = 0, len = 0;

  for (var i = 0, len = args.length; i < len; i++) {
    // discard parsed separator tokens
    args[i] = args[i][1];
  }

  return [action, args];
}

sep = ' '+
eol = "\r" / "\n" / "\r\n"

atom = num:number { return num; }
     / str:string_quoted { return str; }
     / str:string { return str; }

// --------------------- COMMANDS

// TODO:

// --------------------- STRINGS
string = chars:([^" \r\n]+) { return chars.join(''); }

string_quoted = '"' chars:quoted_chars* '"' { return chars.join(''); }
quoted_chars = '\\"' { return '"'; }
             / char:[^"\r\n] { return char; }

// --------------------- NUMBERS
number = integral:('0' / [1-9][0-9]*) fraction:("." [0-9]*)?
{
  if (fraction && fraction.length) {
    fraction = fraction[0] + fraction[1].join('');
  } else {
    fraction = '';
  }

  integral = integral instanceof Array ?
    integral[0] + integral[1].join('') :
    '0';

  return parseFloat(integral + fraction);
}
        / ("." / "0.") fraction:[0-9]+
{
  return parseFloat("0." + fraction.join(''));
}

/*
float = integral:integer? fraction:fraction { return integral + fraction; }

fraction = '.' digits:[0-9]* { return parseFloat('0.' + digits.join('')); }

integer = digits:('0' / [1-9][0-9]*)
{
  if (digits === '0') return 0;
  return parseInt(digits[0] + digits[1].join(''), 10);
}

*/
chakrit
  • 61,017
  • 25
  • 133
  • 162

2 Answers2

3

Solved this by adding !([0-9\.]+[^0-9\.]) which is sort of a look-ahead infront of the number rule.

I know that the atom rule will match so what it effectively does is making the number rule fails a bit sooner. Hopefully this can helps someone with ambiguous cases in the future.

So the number rule now becomes:

number = !([0-9\.]+[^0-9\.]) integral:('0' / [1-9][0-9]*) fraction:("." [0-9]*)?

chakrit
  • 61,017
  • 25
  • 133
  • 162
  • I think that checking that the character trailing ``number`` is a number separator (not an alphanum) would have also worked, and more cheaply. – Apalala Apr 18 '13 at 18:45
  • 1
    @Apalala ah, that is a good idea. Would upvote it if you added it as answer tho. – chakrit Apr 19 '13 at 03:57
  • 1
    In [Grako](https://bitbucket.org/apalala/grako) I added an (optional) automatic check for alphanums after every token that is alphanumeric. It avoids matching "ID" when the stream says "IDENTIFICATION". I haven't had to switch it off so far. – Apalala Apr 19 '13 at 12:51
1

I think that checking that the character trailing number is a number-separator (not an alphanum) would have also worked, and more cheaply.

number = integral:('0' / [1-9][0-9]*) fraction:("." [0-9]*)? !([0-9A-Za-z]) 
Apalala
  • 9,017
  • 3
  • 30
  • 48