antlr 4 lexer rule RULE: ''; isn't recognized as token but if fragment rule then recognized

Question

EDIT: I've been asked if I can provide the full grammar. I cannot and here is the reason why:

I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.

This was posted in the original answer but to prevent frustration from possible helpers I now promote it to the top of the post. Disclaimer: this is homework related.

I am trying to tokenize a piece of text for homework, and almost everything works as expected, except the following:

TIME                    : '<time>';

This rule used to be in my grammar. When tokenizing the piece of text, I would not see the TIME token, instead I would see a '<time>' token (which I guess Antlr created for me somehow). But when I moved the string itself to a fragment rule and made the TIME rule point to it, like so:

fragment TIME_TAG       : '<time>';
.
.
.
TIME                    : TIME_TAG;

Then I see the TIME token as expected. I've been searching the internet for several hours and couldn't find an answer.

Another thing that happens is the ATHLETE rule which is defined as:

ATHLETE                 : WHITESPACE* '<athlete>' WHITESPACE*;

Is also recognized properly and I see the token ATHLETE, but it wasn't recognized when I didn't allow the WHITESPACE* before and after the tag string.

I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.

Here is my piece of text:

World Record World Record
[1] <time> 9.86 <athlete> "Carl Lewis" <country> "United
States" <date> 25 August 1991
[2] <time> 9.69 <athlete> "Tyson Gay" <country> "United
States" <date> 20 September 2009
[3] <time> 9.82 <athlete> "Donovan Baily" <country>
"Canada" <date> 27 July 1996
[4] <time> 9.58
 <athlete> "Usain Bolt"
 <country> "Jamaica" <date> 16 August 2009

[5] <time> 9.79 <athlete> "Maurice Greene" <country>
"United State" <date> 16 June 1999

My task is simply to tokenize it. I am not being given the definitions of tokens, and I am supposed to decide that myself. I think '<sometag>' is pretty obvious, so are '"' wrapped strings, numbers, dates, and square-bracket surrounded enumerations.

Thanks in advance to any help or any useful knowledge.

As I clearly stated in my question, no. This is a homework assignment and I am forbidden to disclose my solution. Sorry. — יחזקאל הירשהורן, Oct 26 '21 at 06:57
So one problem I see with your grammar is related to the need to include WHITESPACE in the ATHLETE rule. You shouldn't have to do that. The lexer should shunt whitespace to the HIDDEN channel thus `WHITESPACE: [ \t\n\r]+ -> channel(HIDDEN);`. If you don't want to see it at all as a HIDDEN token, then use `skip` instead of `channel(HIDDEN)`. The problem statement should say explicitly how whitespace is handled. — kaby76, Oct 26 '21 at 07:39
"As I clearly stated in my question, no. This is a homework assignment and I am forbidden to disclose my solution." fine, then include as much of it so that the original problem still occurs. It's a bit hard to help you if others cannot reproduce what you're describing. — Bart Kiers, Oct 26 '21 at 07:45
@kaby76 true, I didn't want to add whitespaces to the tags, and I didn't. This was just an attempt to find out what causes me to not recognize the tags. My solution is to use the fragments, but I don't understand why it solved the problem and why the problem exists in the first place. — יחזקאל הירשהורן, Oct 26 '21 at 10:49
@BartKiers I am not asking for your help, nor did I say you were unwilling to. I understand it may be difficult to help, and I accept it if no one will be able to. You refuse to believe me, but this really is all I can post. The entire thing is not much more then this, any more will be too much disclosure. Thus, I say again, If it is not possible to help me I accept that. But no, you actually get upset and think I want to hide my code from you for some weird reason. No, I don't. I whish I could post more. I do realize this can be crucial if the problem is not trivial. I was hoping it is. — יחזקאל הירשהורן, Oct 26 '21 at 12:19
Adding `WHITESPACE*` around the string literal `''` to get the lexer to recognize that input likely means you have defined WHITESPACE wrong. Do you have a rule for WHITESPACE? What is it exactly? Otherwise we are stuck for lack of information. This should go to SO Chat, BTW. `fragment` rules are rules that don't produce a token by itself. — kaby76, Oct 26 '21 at 12:22
@kaby76 WHITESPACE really isn't the problem here. It did not help me to recognize '' *properly*, it actually recognized something like this: " " (perhaps even some space after the tag). I didn't mention it because I did solve the problem eventually, without understanding why or how. Thanks for your efforts to help me, though it is difficult with such limited code - I know this. — יחזקאל הירשהורן, Oct 26 '21 at 12:31

Mike Cargal · Accepted Answer · 2021-10-26T13:59:57.570

1

(This will be something of a challenge, without just doing your homework, but maybe a few comments will set you on your way)

The TIME : '<time>'; rule should work just fine. ANTLR only creates tokens for you in parser rules. (parser rules begin with lower case letters, and Lexer rules with uppercase letters, so this wouldn't have been the case with this exact example (perhaps you had a rule name that began with a lower case letter?)

Note: If you dump your tokens, you'll see the TIME token represented like so:

[@3,5:10='<time>',<'<time>'>,2:4]

This means that ANTLR has recognized it as the TIME token (I suspect this may be the source of the confusion. It's just how ANTLR prints out the TIME token.)

As @kaby76 mentions, we usually skip whitespace or throw it into a hidden channel as we don't want to be explicit in parser rules about everywhere we allow whitespace. Either of those options causes them to be ignored by the parser. A very common Whitespace rule is:

WS: [ \t\r\n]+;`.

Since you're only tokenizing, you won't need to worry about parser rules.

Adding this Lexer rule will tokenize whitespace into separate tokens for you so you don't need to account for it in rules like ATHLETE.

You'll need work out Lexer rules for your content, but perhaps this will help you move forward.

edited Oct 26 '21 at 13:59

answered Oct 26 '21 at 13:22

Mike Cargal

6,610
3
21
27

Thanks! there are a couple of new things here for me. First: I double checked now, no lower-case rules in my grammar. I have some fragment rules (with all capitals) and some all-capital lexer rules. And I have the "lexer grammar" declaration up top. I am printing the tokens based on token type (taken from the lexer) using the q2.tokens file. Is it possible that my TIME rule ended up having a '' token name in this file? I will check, of course. I think I saw at least once two separate tokens - TIME and ''. Thanks again, @MikeCargal! – יחזקאל הירשהורן Oct 26 '21 at 13:51
Another question: I used this pattern to try to skip WS: SKIPPED : [ \t\r\n] -> skip; Now, I know what the + does, it means one or more of the previous. What is \f thow? google calls it form feed, don't know what that is. Perhaps this broke my skipping and thus my TIME token. I will try this. Thanks. Also, is the `. in the end meaningful or is it a typo? – יחזקאל הירשהורן Oct 26 '21 at 13:55
The \f character was the culprit after all. Thank you very much! you did the improbable, perhaps the impossible. – יחזקאל הירשהורן Oct 26 '21 at 13:58
If you look in your q2.tokens file, you'll find more than one entry for your number. In the file I got `TIME=1` on the first line and `''=1 farther down in the file. This probably helps ANTLR to recognize Lexer rule-defined tokens when you use the literal in a parser rule. They're just synonyms. (Use the 1st one) – Mike Cargal Oct 26 '21 at 13:59
`\f` was a typo. I've corrected it. It means formed, but it's unlikely you'll encounter it. I meant `\n` – Mike Cargal Oct 26 '21 at 14:00
Ok, I've removed the \f and it still works. I guess something else is different from when I posted this question. I have no idea what, and I will probably never will. I did learn so it's not in vain... – יחזקאל הירשהורן Oct 26 '21 at 15:41
you'll probably want to change the `\f` to `\n` (unless you're handling linefeeds in another Lexer rule (or have no linefeeds ???) – Mike Cargal Oct 26 '21 at 15:47
I had an \n before. Just didn't have a \f. But I've removed it again. I'm back with the usual whitespace characters. – יחזקאל הירשהורן Oct 26 '21 at 15:51
Ok, now I see you said having two entries with the same type in the q2.tokens file is okay and that I should use the first one. This basically means my "problem" is not a problem at all, just a misunderstanding of the format of the .tokens file. Thanks, this could be my solution after all. – יחזקאל הירשהורן Oct 26 '21 at 15:54
Back in the days when program output was printed, rather than just being ephemerally painted on a screen, "form feed" (`\f`) was used to tell the printer to start a new page. So it is white space, just like `\n`, which starts a new line. There's also `\v`, which is officially "vertical tab" (the vertical analog of horizontal tab), but which was sometimes interpreted as "go up a line" (the vertical analog of backspace.) – rici Oct 27 '21 at 01:44

score 0 · Answer 2 · answered Oct 26 '21 at 14:14

The following implementation is a split lexer/parser grammar that "tokenizes" your input file. You can combine the two if you like. I generally split my grammars because of constraints with Antlr lexer grammars, such as when you want to "superClass" the lexer.

But, without a clear problem statement, this implementation may not tokenize the input as required. All software must begin with requirements. If none were given in the assignment, then I would state exactly what are the token types recognized.

In most languages, whitespace is not included in the set of token types consumed by a parser. Thus, I implemented it with "-> skip", which tells the lexer to not produce a token for the recognized input.

It's also not clear whether input such as "[1]" is to be tokenized as one token or separately. In the following implementation, I produce separate tokens for '[', '1', and ']'.

The use of "fragment" rules is likely unnecessary so I don't include any use of the feature. "fragment" rules cannot be used to produce a token in itself, and the symbol cannot be used in a parser rule. They are useful for reuse of a common RHS. You can read more about it here.

FooLexer.g4:

lexer grammar FooLexer;
Athlete : '<athlete>';
Date : '<date>';
Time : '<time>';
Country : '<country>';
StringLiteral : '"' .*? '"';
Stray : [a-zA-Z]+;
OB : '[';
CB : ']';
Number : [0-9.]+;
Ws : [ \t\r\n]+ -> skip;

FooParser.g4:

parser grammar FooParser;
options { tokenVocab = FooLexer; }
start: .* EOF;

Tokens:

$ trparse input.txt | trtokens
Time to parse: 00:00:00.0574154
# tokens per sec = 1219.1850966813781
[@0,0:4='World',<6>,1:0]
[@1,6:11='Record',<6>,1:6]
[@2,13:17='World',<6>,1:13]
[@3,19:24='Record',<6>,1:19]
[@4,27:27='[',<7>,2:0]
[@5,28:28='1',<9>,2:1]
[@6,29:29=']',<8>,2:2]
[@7,31:36='<time>',<3>,2:4]
[@8,38:41='9.86',<9>,2:11]
[@9,43:51='<athlete>',<1>,2:16]
[@10,53:64='"Carl Lewis"',<5>,2:26]
[@11,66:74='<country>',<4>,2:39]
[@12,76:91='"United\r\nStates"',<5>,2:49]
[@13,93:98='<date>',<2>,3:8]
[@14,100:101='25',<9>,3:15]
[@15,103:108='August',<6>,3:18]
[@16,110:113='1991',<9>,3:25]
[@17,116:116='[',<7>,4:0]
[@18,117:117='2',<9>,4:1]
[@19,118:118=']',<8>,4:2]
[@20,120:125='<time>',<3>,4:4]
[@21,127:130='9.69',<9>,4:11]
[@22,132:140='<athlete>',<1>,4:16]
[@23,142:152='"Tyson Gay"',<5>,4:26]
[@24,154:162='<country>',<4>,4:38]
[@25,164:179='"United\r\nStates"',<5>,4:48]
[@26,181:186='<date>',<2>,5:8]
[@27,188:189='20',<9>,5:15]
[@28,191:199='September',<6>,5:18]
[@29,201:204='2009',<9>,5:28]
[@30,207:207='[',<7>,6:0]
[@31,208:208='3',<9>,6:1]
[@32,209:209=']',<8>,6:2]
[@33,211:216='<time>',<3>,6:4]
[@34,218:221='9.82',<9>,6:11]
[@35,223:231='<athlete>',<1>,6:16]
[@36,233:247='"Donovan Baily"',<5>,6:26]
[@37,249:257='<country>',<4>,6:42]
[@38,260:267='"Canada"',<5>,7:0]
[@39,269:274='<date>',<2>,7:9]
[@40,276:277='27',<9>,7:16]
[@41,279:282='July',<6>,7:19]
[@42,284:287='1996',<9>,7:24]
[@43,290:290='[',<7>,8:0]
[@44,291:291='4',<9>,8:1]
[@45,292:292=']',<8>,8:2]
[@46,294:299='<time>',<3>,8:4]
[@47,301:304='9.58',<9>,8:11]
[@48,308:316='<athlete>',<1>,9:1]
[@49,318:329='"Usain Bolt"',<5>,9:11]
[@50,333:341='<country>',<4>,10:1]
[@51,343:351='"Jamaica"',<5>,10:11]
[@52,353:358='<date>',<2>,10:21]
[@53,360:361='16',<9>,10:28]
[@54,363:368='August',<6>,10:31]
[@55,370:373='2009',<9>,10:38]
[@56,378:378='[',<7>,12:0]
[@57,379:379='5',<9>,12:1]
[@58,380:380=']',<8>,12:2]
[@59,382:387='<time>',<3>,12:4]
[@60,389:392='9.79',<9>,12:11]
[@61,394:402='<athlete>',<1>,12:16]
[@62,404:419='"Maurice Greene"',<5>,12:26]
[@63,421:429='<country>',<4>,12:43]
[@64,432:445='"United State"',<5>,13:0]
[@65,447:452='<date>',<2>,13:15]
[@66,454:455='16',<9>,13:22]
[@67,457:460='June',<6>,13:25]
[@68,462:465='1999',<9>,13:30]
[@69,466:465='',<-1>,13:34]

Thanks. I whish I could upvote your answer. Indeed no boss would give a task without requirements, but this is a compilation course so I guess the lecturer wants us to experience having to actually design the language - starting with choosing the tokens. You mentioned "superclassing" the lexer, where can I read about it? I was looking for that. — יחזקאל הירשהורן, Oct 26 '21 at 15:47
@יחזקאלהירשהורן https://github.com/antlr/antlr4/blob/master/doc/options.md#grammar-options . There are many examples of Antlr in [grammar-v4](https://github.com/antlr/grammars-v4), which contain a wealth of information. An example superClass of the lexer is [here](https://github.com/antlr/grammars-v4/tree/master/java/java9) (unoptimized i.e. slow version of Java). — kaby76, Oct 26 '21 at 16:56

antlr 4 lexer rule RULE: ''; isn't recognized as token but if fragment rule then recognized

2 Answers2