2

I am new in this stuff, and for that reason I will need your help.. I am trying to parse the Wikipedia Dump, and my first step is to map each rule defined by them into ANTLR, unfortunally I got my first barrier:

line 1:8 extraneous input ''''' expecting '\'\''

I am not understanding what is going on, please lend me your help.

My code:

grammar Test;

options {
    language = Java;
}

parse
    :  term+ EOF
    ;

term 
    :  IDENT
    |  '[[' term ']]'
    |  '\'\'' term '\'\''
    |  '\'\'\'' term '\'\'\''
    ;    

IDENT
    :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*
    ;

enter image description here

Input '''''Hello World'''''

1 Answers1

1

A lexer rule must always match at least 1 character. Your rule:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;

matches an empty string (of which there are an infinite amount of). Change the * to a +:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

EDIT

Input '''''Hello World'''''

Although you put literal tokens inside parser rules ('\'\'\'', '\'\'', etc.), you must understand that they are not created at the behest of the parser. The lexer follows strict rules to create tokens:

  1. it tries to match as much as possible
  2. if 2 different lexer rules match the same amount of characters, the one defined first will get precedence

Let's give your literal tokens a name:

BRACKET_OPEN  : '[[';
BRACKET_CLOSE : ']]';
Q3            : '\'\'\'';
Q2            : '\'\'';
IDENT         :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

Now, because of rule #1 (match as much as possible), the input '''''Hello World''''' will be tokenized as follows:

  • Q3
  • Q2
  • IDENT
  • Q3 (yes, a Q3!)
  • Q2

But your parser rule term will only accept Q3 Q2 IDENT Q2 Q3, so it is correct that your input fails to parse properly.

Also, I recommend you not use the interpreter: it's rather buggy. The debugger works like a charm though!

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • I miss the test, tks I changed the rule to be plus(+) but unfortunally does not solve the issue :(, changed parse : term* EOF // Now I can insert no character – user3216500 Jan 20 '14 at 20:32
  • @user3216500, then I need more info. Did you post your entire grammar? Can you post the input you're parsing? How are you testing this: in ANTLRWorks' interpreter or its debugger? – Bart Kiers Jan 20 '14 at 21:10
  • I added the input '''''Hello World''''' I am using the interpreter, and yes it is the entire grammar :) – user3216500 Jan 20 '14 at 21:17
  • @user3216500 The interpreter feature in ANTLRWorks 1.x is frequently wildly inaccurate. You should not trust its results to be at all representative of what will actually happen when you run your grammar. – Sam Harwell Jan 20 '14 at 21:30
  • @BartKiers, thanks for your feedback, it was very usefull, so what I got is that ANTLR will not handle my concerns I will need to find any other solution, or you know some workaround or another approach? Using java code I will get, to the same input with your changes: line 1:8 extraneous input ''''' expecting Q2 Hello World Ok! ---- Using debug in fact, I will get what you says it will try to match much as possible :( – user3216500 Jan 20 '14 at 21:56
  • @user3216500, you might want to look into PEG's instead (lexer-less parsing). I've heard good stories about Parboiled: https://github.com/sirthias/parboiled/wiki – Bart Kiers Jan 20 '14 at 22:20