Not Able to Recognize Strings and Characters in ANTLr

Question

In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc.

However, in my code, when I test it like this:

grun A1_lexer tokens -tokens test.txt

With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens.

In my lexer, I define the following helper tokens:

fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ;
fragment Digit: ['0'-'9'] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
fragment Hex_digit: Digit | [a-fA-F] ;

And I define the following tokens:

Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Id: Alpha Alpha_num* ;

I run it like this:

grun A1_lexer tokens -tokens test.txt

And it outputs this:

line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[@0,5:5='a',<Id>,1:5]
[@1,12:11='<EOF>',<EOF>,2:0]

I am really wondering what the problem is and how I could fix it. Thanks.

UPDATE 1:

fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;

I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before.

UPDATE 2:

Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile.

antlr4 A1_lexer.g4 
javac A1_lexer*.java
chmod a+x build.sh
./build.sh
grun A1_lexer tokens -tokens test.txt

With my build.sh file looking like this:

#!/bin/bash
FILE="A1_lexer"
ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1- 
complete.jar")
java -jar $ANTLR $FILE.g4
javac $FILE*.java

Even when I recompile, my antlr code is still unable to recognize the tokens.

My code is also now like this:

fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;

Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;

UPDATE 3:

Grammar:

program
:'class Program {'field_decl* method_decl*'}'

field_decl
: type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';'
| type id '=' literal ';'

method_decl
: (type | 'void') id'('( (type id) ( ','type id)*)? ')'block

block
: '{'var_decl* statement*'}'

var_decl
: type id(','id)* ';'

type
: 'int'
| 'boolean'

statement
: location assign_op expr';'
| method_call';'
| 'if ('expr')' block ('else' block  )?
| 'switch' expr '{'('case' literal ':' statement*)+'}'
| 'while (' expr ')' statement
| 'return' ( expr )? ';'
| 'break ;'
| 'continue ;'
| block

assign_op
: '='
| '+='
| '-='

method_call
: method_name '(' (expr ( ',' expr )*)? ')'
| 'callout (' string_literal ( ',' callout_arg )* ')'

method_name
: id

location
: id
| id '[' expr ']'

expr
: location
| method_call
| literal
| expr bin_op expr
| '-' expr
| '!' expr
| '(' expr ')'

callout_arg
: expr
| string_literal

bin_op
: arith_op
| rel_op
| eq_op
| cond_op

arith_op
: '+'
| '-'
| '*'
| '/'
| '%'

rel_op
: '<'
| '>'
| '<='
| '>='

eq_op
: '=='
| '!='

cond_op
: '&&'
| '||'

literal
: int_literal
| char_literal
| bool_literal

id
: alpha alpha_num*

alpha
: ['a'-'z''A'-'Z''_']

alpha_num
: alpha
| digit 

digit
: ['0'-'9']

hex_digit
: digit
| ['a'-'f''A'-'F']

int_literal
: decimal_literal
| hex_literal

decimal_literal
: digit+

hex_literal
: '0x' hex_digit+

bool_literal
: 'true'
| 'false'

char_literal
: '‘'char'’'

string_literal
: '“'char*'”'

test.txt :

"pineapple"

A1_lexer:

fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;

Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;

What I Write in Terminal:

grun A1_lexer tokens -tokens test.txt

Output in Terminal:

line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[@0,5:5='a',<Id>,1:5]
[@1,12:11='<EOF>',<EOF>,2:0]

I am really not sure why this is happening.

sepp2k · Answer 1 · 2018-09-29T21:22:46.950

1

fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9']

['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule).

The same applies to the Digit rule as well.

Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.

edited Sep 29 '18 at 21:22

answered Sep 29 '18 at 21:08

sepp2k

363,768
54
674
675

Unfortunately, even when I make those changes, the errors remain. – SeePlusPlus Sep 29 '18 at 21:13
My lexer is unable to recognize " as a token and it can't recognize the characters inside either. – SeePlusPlus Sep 29 '18 at 21:13
Also I am not sure why it considers the letter 'a' as an identifier. – SeePlusPlus Sep 29 '18 at 21:14
@SeePlusPlus Can you post your updated code? It considers `a` to be an identifer because `a` is actually one of the few letters actually included in "a single quote, a, or z". So it discards everything else as unknown and is left with a single character that matches the Id rule. – sepp2k Sep 29 '18 at 21:19
@SeePlusPlus Do you really get the same errors? Because those should be gone now. What I get when I run your grammar, is a warning that `\"` is an invalid escape sequence (remove the `\`) and then an infinite loop at runtime. After removing the `\`, it works fine. If you're still getting the old errors, maybe you forgot to re-run ANTLR and/or to recompile? If that isn't it, please post your entire current grammar, your input file and the exact current error messages, so I'm running the same code as you when I'm testing it. – sepp2k Sep 29 '18 at 21:36
I'm sorry, I'm not sure I understand what you mean when you say you are removing something? What exactly are you removing and from where? – SeePlusPlus Sep 29 '18 at 21:44
@SeePlusPlus Sorry, markdown mix up. I meant "removing the backslash". I.e. change `'\"'` to `'"'`. – sepp2k Sep 29 '18 at 21:46
I have edited my question, please see the changes above. – SeePlusPlus Sep 29 '18 at 21:54
@SeePlusPlus What does your input file look like? And what are your error messages? Are they really exactly the same still? If I run your exact code (plus a `grammar`-line and an empty parser rule that does nothing) on the input `"lala"'l'lu`, I get the tokens `"lala"`, `'l'` and `lu` - no errors. – sepp2k Sep 29 '18 at 22:08
Please see the changes above. – SeePlusPlus Sep 29 '18 at 22:16
Note: I have posted the grammar that we must base our lexer on. I have more definitions in my lexer, but the ones that I posted are the only ones that I have problems with. So, I don't really think it's necessary to post all of my lexer definitions. – SeePlusPlus Sep 29 '18 at 22:21
@SeePlusPlus At this point, the only other thing I can think of is that you have two copies of your code and you're editing one, but compiling the other. Given the code as you have it now, you should not be getting those errors anymore. With your exact grammar and input, I don't get any errors. – sepp2k Sep 29 '18 at 22:27
Okay, thank you very much for your help. Also, one more question. I have this token definition: `Hex_literal: '0x' Hex_digit+ ;` . But when I try to test a hexadecimal number, it doesn't seem to be able to recognize. Do you know perhaps why this is occurring? – SeePlusPlus Sep 29 '18 at 22:33
And just to confirm, you were correct. I was editing one A1_lexer, while compiling another. – SeePlusPlus Sep 29 '18 at 22:33
@SeePlusPlus You should post that as a separate question including the input you're trying and the error messages (or wrong output) you're getting. From what I can tell what you have should work fine (unless you have some other rule that conflicts with it perhaps). – sepp2k Sep 29 '18 at 22:39

Not Able to Recognize Strings and Characters in ANTLr

1 Answers1