ANTLR4 Hexadecimal Parsing

Question

I'm having issues debugging an ANTLR grammar I'm working on for Gameboy Assembly. It seems to work normally, but for some reason it cannot handle 0x notation for Hexadecimal in certain edge cases.

If my input string is "JR 0x10" antlr fails with a 'no viable alternative at input' error. As I understand it, it means I either have no rule to parse the token stream or '0x' is not properly being understood. If I use "JR $10" (one of the alternate notations I support) it works perfectly. But '0x' and '$' are expressed in the same rule.

Here is my g4 file:

grammar GBASM;

eval : exp EOF;
exp : exp op | exp sys | op | sys;

sys : include | section | label | data;
op : monad | biad arg | triad arg SEPARATOR arg;

monad : NOP|RLCA|RRCA|STOP|RLA|RRA|DAA|CPL|SCF|CCF|HALT|RETI|DI|EI|RST|RET;

biad : INC|DEC|SUB|AND|XOR|OR|CP|POP|PUSH|RLC|RRC|RL|RR|SLA|SRA|SWAP|SRL|JP|JR;

triad : RET|JR|JP|CALL|LD|LDD|LDI|LDH|ADD|ADC|SBC|BIT|RES|SET;

arg : (register|value|negvalue|flag|offset|jump|memory);

memory : MEMSTART (register|value|jump) MEMEND;

offset : register Plus value | register negvalue;

register : A|B|C|D|E|F|H|L|AF|BC|DE|HL|SP|HLPLUS|HLMINUS;

flag : NZ | NC | Z | C;

data : DB db;
db : string_data | value | string_data SEPARATOR db | value SEPARATOR db;
include : INCLUDE string_data;
section : SECTION string_data SEPARATOR HOME '[' value ']';

string_data: STRINGLITERAL;
jump : LIMSTRING;
label : LIMSTRING ':';

Z : 'Z';

A : 'A';
B : 'B';
C : 'C';
D : 'D';
E : 'E';
F : 'F';
H : 'H';
L : 'L';
AF : 'AF';
BC : 'BC';
DE : 'DE';
HL : 'HL';
SP : 'SP';
NZ : 'NZ';
NC : 'NC';

value : HexInteger | Integer;
negvalue : (Neg Integer) | (Neg HexInteger);


Neg : '-';
Plus : '+';
HexInteger : (HexPrefix HexDigit+) | (HexDigit+ HexPostfix);

Integer : Digit+;

fragment Digit : ('0'..'9');

HLPLUS : 'HL+' | 'HLI';
HLMINUS : 'HL-' | 'HLD';
MEMSTART : '(';
MEMEND : ')';

LD : 'LD' | 'ld';
JR : 'JR' | 'jr';
JP : 'JP' | 'jp';
OR : 'OR' | 'or';
CP : 'CP' | 'cp';
RL : 'RL' | 'rl';
RR : 'RR' | 'rr';
DI : 'DI' | 'di';
EI : 'EI' | 'ei';

DB : 'DB';

LDD : 'LDD' | 'ldd';
LDI : 'LDI' | 'ldi';
ADD: 'ADD' | 'add';
ADC : 'ADC' | 'adc';
SBC : 'SBC' | 'sbc';
BIT : 'BIT' | 'bit';
RES : 'RES' | 'res';
SET : 'SET' | 'set';
RET: 'RET' | 'ret';
INC : 'INC' | 'inc';
DEC : 'DEC' | 'dec';
SUB : 'SUB' | 'sub';
AND : 'AND' | 'and';
XOR : 'XOR' | 'xor';
RLC : 'RLC' | 'rlc';
RRC : 'RRC' | 'rrc';
POP: 'POP' | 'pop';

SLA : 'SLA' | 'sla';
SRA : 'SRA' | 'sra';

SRL : 'SRL' | 'srl';
NOP : 'NOP' | 'nop';
RLA : 'RLA' | 'rla';
RRA : 'RRA' | 'rra';
DAA : 'DAA' | 'daa';
CPL : 'CPL' | 'cpl';
SCF : 'SCF' | 'scf';
CCF : 'CCF' | 'ccf';
LDH : 'LDH' | 'ldh';
RST : 'RST' | 'rst';
CALL : 'CALL' | 'call';

PUSH : 'PUSH' | 'push';

SWAP : 'SWAP' | 'swap';
RLCA : 'RLCA' | 'rlca';
RRCA : 'RRCA' | 'rrca';
STOP : 'STOP 0' | 'STOP' | 'stop 0' | 'stop';
HALT: 'HALT' | 'halt';
RETI: 'RETI' | 'reti';

HOME: 'HOME';
SECTION: 'SECTION';
INCLUDE: 'INCLUDE';

fragment HexPrefix : ('0x' | '$');
fragment HexPostfix : ('h' | 'H');
fragment HexDigit : ('0'..'9'|'a'..'f'|'A'..'F');
STRINGLITERAL : '"' ~["\r\n]* '"';
LIMSTRING : ('_'|'a'..'z'|'A'..'Z'|'0'..'9')+;
SEPARATOR : ',';
WS : (' '|'\t'|'\n'|'\r') ->channel(HIDDEN);
COMMENT : ';' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);

In the failing case it looks like I terminate on 'op', in the passing case it correctly drills down to 'value' and my parser snags the information. Is there some quirk of ANTLR4 grammar that I'm missing?

I'm generating a C# parser in case that's relevant.

Probably the lexer returns different tokens than what you expected. I recommend to print out the token list to see what the lexer finds. That should quickly lead to the wrong rule that is eating your 0x... part. — Mike Lischke, Nov 30 '15 at 07:55
Yeah, that helped. It turns out I was using a stale grammar. — plasmarobo, Nov 30 '15 at 21:19

score 1 · Answer 1 · answered Nov 30 '15 at 21:19

It turns out it was the order of my hexadecimal rules.

The reason I didn't see anything change was because Visual Studio was looking at an old copy of my grammar (because Microsofts file-path system is somewhat... alternative).

My modified grammar works perfectly.

Thanks!

ANTLR4 Hexadecimal Parsing

1 Answers1