ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Question

Is there any way to express this in ANTLR4:

Any string as long as it doesn't contain the asterisk immediately followed by a forward slash?

This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'

This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.

Can you provide a little more details on what you are trying to achieve ? Must the string-without-*/ absolutely be recognized as a single lexer token ? — Marc Q., Apr 16 '15 at 10:17
Hi Marc. Yes, a single lexer token: the lexer rule should return a single string as long as it doesn't contain */ — Roger Costello, Apr 16 '15 at 10:41

Radosław Kotkiewicz · Answer 1 · 2015-08-28T09:21:55.227

6

I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.

edited Aug 28 '15 at 09:21

answered Aug 28 '15 at 09:11

Radosław Kotkiewicz

61
1
4

If you've coded more than one lexer for languages with /*....*/ comments, you should already know this trick. – Ira Baxter Jun 25 '16 at 11:04

score 2 · Accepted Answer · answered Apr 16 '15 at 11:26

The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.

The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.

grammar myTest;

@header
{
    import java.util.*;
}

@parser::members
{
    boolean hasEvilCharacters(String input)
    {
        if (input.contains("*/"))
        {
            return false;
        }
        else
        {
            return true;
        }
    }
}

// Mimics a very simple sentence, such as: 
//   I am clean.
//   I have evil char*/acters.
myTest
    : { hasEvilCharacters(_input.LT(1).getText()) }? String 
      (Space { hasEvilCharacters(_input.LT(1).getText()) }? String)* 
      Period EOF
    ;

String
    : ('A'..'Z' | 'a'..'z')+      
    ;

Space
    : ' '
    ;

Period
    : '.'
    ;

Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.

Thanks James! Wow, that is a huge amount of work to solve such a simple problem. — Roger Costello, Apr 16 '15 at 13:38
Can somebody explain how this works? I don't see where it captures a token that contains a "*". (I — Ira Baxter, Apr 17 '15 at 07:33
Look in the parse method `hasEvilCharacters()`, which is called by the semantic predicate in the `myTest` parser rule. For more info, read chapter 10 in The Definitive ANTLR 4 Reference. — james.garriss, Apr 17 '15 at 11:03

CoronA · Answer 3 · 2015-04-17T07:02:04.953

1

If the disallowed sequences are few there exists a solution without parser/lexer actions:

grammar NotParser;

program
    : (starslash | notstarslash)+
    ; 

notstarslash
    : NOT_STAR_SLASH
    ;

starslash
    : STAR_SLASH
    ;

STAR_SLASH
    : '*'+ '/'
    ;

NOT_STAR_SLASH
    : (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
    ;

fragment F_NOT_STAR_SLASH
    : ~('*'|'/')
    ;

fragment F_STAR_NOT_SLASH
    : '*'+ ~('*'|'/')
    | '*'+ EOF
    | '/'
    ;

The idea is to compose the token of

all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'

There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

edited Apr 17 '15 at 07:02

answered Apr 17 '15 at 03:07

CoronA

7,717
2
26
53

... what happens with ANTLR if the last character in a file is an "*"? What does the ~'/' test do or accept? – Ira Baxter Apr 17 '15 at 06:31
Should do now with the special cases (trailing '*', single '/', multiple '*'). Maybe the parse tree does not match the expectations ... but that is hard to tune without knowing the application. – CoronA Apr 17 '15 at 07:07
The complexity and non-scalability of your answer and mine indicates a weakness--or is it an opportunity?--in ANTLR 4. – james.garriss Apr 17 '15 at 11:07

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

3 Answers3

Linked