ANTLR4 - How to close "longest-match-wins" and use first match rule?

Question

Orignial question:

My code to parse: N100G1M4 What I expcted: N100 G1 M4 But ANTLR can not idetify this because ANTLR always match longest substring? How to handle the case?

Update

What I am going to do:

I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.

The G-Code rule is :

// Define a grammar called Hello
grammar GCode;

script  : blocks+ EOF;

blocks: 
      assign_stat
    | ncblock 
    | NEWLINE
    ;

ncblock : 
     ncelements  NEWLINE  // 
    ;
ncelements :
        ncelement+
    ;

ncelement 
    :   
        LINENUMEXPR    // linenumber N100 
    |   GCODEEXPR   // G10 G54.1
    |   MCODEEXPR   // M30
    |   coordexpr   // X100 Y100 Z[A+b*c]
    |   FeedExpr    // F10.12
    |   AccExpr     // E2.0
    // |   callSubroutine 
    ;

assign_stat: 
        VARNAME '=' expression NEWLINE
    ;

expression: 
       multiplyingExpression  ('+' | '-') multiplyingExpression   
    ;

multiplyingExpression
   : powExpression (('*' | '/') powExpression)*
   ;

powExpression
   : signedAtom ('^' signedAtom)*
   ;

signedAtom
   : '+' signedAtom
   | '-' signedAtom
   | atom
   ;

atom
   : scientific
   | variable
   | '(' expression ')'
   ;

LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr: 
        CoordExpr
    |   ParameterKeyword getValueExpr
    ;

getValueExpr: 
        '[' expression ']'
    ;

CoordExpr 
        : 
         ParameterKeyword SCIENTIFIC_NUMBER
        ;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;



fragment
GPOSTFIX
    : Digit+ ('.' Digit+)*
    ;

variable
   : VARNAME
   ;

scientific
   : SCIENTIFIC_NUMBER
   ;

SCIENTIFIC_NUMBER
   : SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
   ;

fragment NUMBER
   : ('0' .. '9') + ('.' ('0' .. '9') +)?
   ;

HEX_INTEGER
 : '0' [xX] HEX_DIGIT+
 ;

fragment HEX_DIGIT
 : [0-9a-fA-F]
 ;
 
INT : Digit+;

fragment
Digit : [0-9];

fragment 
SIGN
   : ('+' | '-')
   ;

VARNAME
    : [a-zA-Z_][a-zA-Z_0-9]*
    ;

NEWLINE 
    : '\r'? '\n'
    ;

WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines

Sample program(it works well except the last line):

N200 G54.1
a = 100
b = 10
c = a + b 
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20

N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line

I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to

-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:

I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.

Append question: What's the best tool to finish the task?

First: Your rule is incomplete, expr is defined but not used (I assume that keyword derives to `N|...|H expr`. Second: Perhaps you just put the `keyword+` rule into the lexer instead of the parser. Your parsing example seems to do other things than you describe above. I think you should clarify these points and perhaps post the real antlr grammar you tried. — CoronA, Mar 12 '22 at 05:46
When I google gcode, I always see examples where the words/ids are separated by spaces. Looks like the stream you get your gcode sources from shouldn’t do that, and just provide valid gcode with spaces. That way, you wont have any issue parsing it. If the input without spaces is valid, then please add a link to the specification of gcode that explains the language. — Bart Kiers, Mar 12 '22 at 06:51
I think there's an easy solution for your problem, but we need to see your grammar attempt first, to be sure. — Mike Lischke, Mar 12 '22 at 10:01
Hello @MikeLischke, I have update the grammar file I am attempting. Thanks and wait for your feedback. — zhhui, Mar 13 '22 at 14:36
Hello @BartKiers, you are totally right that most of CNC producer define the gcode where the words and ids are separated by spaces. Why I have to support the feature that I have to compatible with our last generation CNC system which support programming style like ``G1X100Y200``. Unfortunately, I can find some links to explainate this. — zhhui, Mar 13 '22 at 14:42
Is this valid in you flavor of GCode: `N1000 = 42` and then later somewhere you have `Z[N1000+100]`? If that is valid, how can the lexer (or parser) make a distinction between `N1000 = ...` and the N1000 in `N100G1X100...`? — Bart Kiers, Mar 13 '22 at 15:05
Hello @BartKiers, sure it's invalid to declare a variable ``N1000 = 42``. How to make a distinction between ``gcode standard `` and the `` ambigious variableName(or function name later)`` is the most important problem for the interpreter I want to build. I'm still trying to find some rule to find a solution. Honestly I don't know how to do this now. — zhhui, Mar 14 '22 at 00:58
Let's break it down to a super simple question. If you can answer that, we may have a solution, otherwise it's impossible to do what you want. Take the input `N100` only. Is this a variable name or a line number expression and how to distinguish both? — Mike Lischke, Mar 14 '22 at 07:58
For ``N100``, it's a line nunber of gcode. We call sample ``N100 X100Y100 `` as a ``NC block`` . The ``NC block`` must be start with ``NC elements``: ``N100``, ``X10.5`` ... , and ends with a ``\n``. The word behind ``N100`` must be ``X100`` ``Y100`` and so on. If user programs ``N100 = 10``, then we have to report an error. But if user programs ``NNVAR100 = 20``, then it's valid assign statement. — zhhui, Mar 14 '22 at 08:36

score 1 · Answer 1 · answered Mar 11 '22 at 10:07

1

ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.

How to handle the case?

The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.

Scannerless Parser Generators

answered Mar 11 '22 at 10:07

Bart Kiers

166,582
36
299
288

Thanks a lot for your answer. What I want to do actually is to parse the word `(G NUM)[whitespace]+？M[whitespace]+?` Looks like I cannot use ANTLR. Maybe I should switch to flex&&bison. But flex&&bison is too complex for me. Is there other tools which easier than flex&&yacc? I hope it works as easy as ANTLR? – zhhui Mar 11 '22 at 10:29
1

AFAIK, you'll also have this issue with tools like flex/yacc. Sure you can use ANTLR or similar tools, but then you'd need to resort to (many) embedded action. If you update your question and explain in detail what is it you're trying to do, perhaps someone can help. The fact that my answer (and this comment) contains more text than you original question, is telling about the lack of details in your question. – Bart Kiers Mar 11 '22 at 10:38
Hello. I update the question. I hope it's more clearly now. – zhhui Mar 12 '22 at 00:58

ANTLR4 - How to close "longest-match-wins" and use first match rule?

Update

1 Answers1