3

Specifically, I am trying to implement a RegExp parser in ANTLR.

Here are the relevant parts of my grammar:

grammar JavaScriptRegExp;
options {
    language = 'CSharp3';
}

tokens {
    /* snip */
    QUESTION = '?';
    STAR = '*';
    PLUS = '+';
    L_CURLY = '{';
    R_CURLY = '}';
    COMMA = ',';
}

/* snip */

quantifier returns [Quantifier value]
    :   q=quantifierPrefix QUESTION?
        {
            var quant = $q.value;
            quant.Eager = $QUESTION == null;
            return quant;
        }
    ;

quantifierPrefix returns [Quantifier value]
    :   STAR { return new Quantifier { Min = 0 }; }
    |   PLUS { return new Quantifier { Min = 1 }; }
    |   QUESTION { return new Quantifier { Min = 0, Max = 1 }; }
    |   L_CURLY min=DEC_DIGITS (COMMA max=DEC_DIGITS?)? R_CURLY
        {
            var minValue = int.Parse($min.Text);
            if ($COMMA == null)
            {
                return new Quantifier { Min = minValue, Max = minValue };
            }
            else if ($max == null)
            {
                return new Quantifier { Min = minValue, Max = null };
            }
            else
            {
                var maxValue = int.Parse($max.Text);
                return new Quantifier { Min = minValue, Max = maxValue };
            }
        }
    ;

DEC_DIGITS
    :   ('0'..'9')+
    ;

/* snip */

CHAR
    :   ~('^' | '$' | '\\' | '.' | '*' | '+' | '?' | '(' | ')' | '[' | ']' | '{' | '}' | '|')
    ;

Now, INSIDE of the curly braces, I would like to tokenize ',' as COMMA, but OUTSIDE, I would like to tokenize it as CHAR.

Is this possible?

This is not the only case where this is happening. I will have many other instances where this is a problem (decimal digits, hyphens in character classes, etc.)

EDIT:

I know realize that this is called context-sensitive lexing. Is this possible with ANTLR?

John Gietzen
  • 48,783
  • 32
  • 145
  • 190

2 Answers2

3

this is called context-sensitive lexing. Is this possible with ANTLR?

No, the parser cannot "tell" the lexer it needs to treat, say, a digit different at a certain time during parsing. There is some context-sensitive lexing possible in the lexer alone, but the parser cannot influence the lexer.

However, it can be easily solved with some extra parser rules. For example, when matching a character class ([ ... ]), you use a parser rule that matches whatever is valid inside a character class:

char_class
 : LBRACK char_class_char+ RBRACK
 ;

// ...

char_class_char
 : LBRACK // the '[' is not special inside a character class!
 | LBRACE // the '{' is not special inside a character class!
 | RBRACE // the '}' is not special inside a character class!
 | PLUS   // the '+' is not special inside a character class!
 | STAR   // the '*' is not special inside a character class!
 | QMARK  // the '?' is not special inside a character class!
 | COMMA
 | DIGIT
 | OTHER
 ;

A small demo:

grammar T;

parse
 : atom* EOF
 ;

atom
 : unit quantifier?
 ;

unit
 : char_class
 | single_char
 ;

quantifier
 : greedy (PLUS | QMARK)?
 ;

greedy
 : PLUS
 | STAR
 | QMARK
 | LBRACE (number (COMMA number?)?) RBRACE
 ;

char_class
 : LBRACK char_class_char+ RBRACK
 ;

number
 : DIGIT+
 ;

single_char
 : DIGIT
 | COMMA
 | RBRACE
 | RBRACK // this is only special inside a character class
 | OTHER
 ;

char_class_char
 : LBRACK
 | LBRACE
 | RBRACE
 | PLUS
 | STAR
 | QMARK
 | COMMA
 | DIGIT
 | OTHER
 ;

LBRACK : '[';
RBRACK : ']';
LBRACE : '{';
RBRACE : '}';
PLUS   : '+';
STAR   : '*';
QMARK  : '?';
COMMA  : ',';
DIGIT  : '0'..'9';
OTHER  : . ;

which would parse the input "[+*]{5,20}?A*+" as follows:

enter image description here

A more complete PCRE grammar can be found here: https://github.com/bkiers/PCREParser (the grammar can be found here)

EDIT

That it, I would prefer to tokenize "," as COMMA inside of the curly braces, but tokenize it as CHAR outside. I will use the workaround for now, but is that possible?

No, like I said: the lexer cannot be influenced by the parser. If you want this, you should go for a PEG instead of ANTLR. With ANTLR, there simply is a strict separation between lexing and parsing: you cannot do anything about that.

However, you could just change the type of the token that is matched in a parser rule. Every parser rule has a $start and $end token denoting the first and last token it matches. Since char_class_char (and single_char) will always match a single token, you can change the type of the token in the @after block of the rule like this:

single_char
@after{$start.setType(CHAR);}
 : DIGIT
 | COMMA
 | RBRACE
 | RBRACK // this is only special inside a character class
 | OTHER
 ;

char_class_char
@after{$start.setType(CHAR);}
 : LBRACK
 | LBRACE
 | RBRACE
 | PLUS
 | STAR
 | QMARK
 | COMMA
 | DIGIT
 | CHAR
 ;

// ...

CHAR : . ;

resulting in the behavior you're after (I guess).

HTH

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • I guess the trick is just to make the parser accept those tokens in both places. I would prefer that the tokens be semantically correct, however. That it, I would prefer to tokenize "," as COMMA inside of the curly braces, but tokenize it as CHAR outside. I will use the workaround for now, but is that possible? – John Gietzen Jul 15 '12 at 15:37
2

It is possible to do this using gated semantic predicates in the lexer. In the code below ',' will match the COMMA rule only if the isComma is true. Otherwise it will match CHAR provided CHAR appears after COMMA in the grammar. I don't know CSharp so I can't give a complete example.

L_CURLY : '{' {setComma();};
R_CURLY : '}' {clearComma();};
COMMA : {isComma}? => ',';

Obviously if curly braces are used in different contexts, this may not work. I recommend avoiding using the lexer this way unless it really makes a mess of the parser.

  • That is true, however, since regex has a lot of characters that will need to be handled differently in certain contexts (the `^` inside character classes, most normal meta chars that are not special inside character classes, all characters inside `\Q` and `\E`, and many many more...), I would not recommend this. – Bart Kiers Jul 16 '12 at 15:28
  • This is almost exactly what I was looking for. However, I intend to switch to a PEG style parser, rather than ANTLR. – John Gietzen Jul 16 '12 at 16:16
  • @Bart: CARAT is special both inside and outside the character class. The only *really* strange character is the hyphen. Everything else is pretty much OK. – John Gietzen Jul 18 '12 at 15:23
  • @JohnGietzen, well, `[^a]` matches any char other than the literal `'a'` while `[a^]` matches either the literal `'a'` or the literal `'^'`. And inside a character class the hyphen does not have any special meaning when placed at the start or end of the class, or when placed directly after a range or shorthand character class: i.e. the hyphen in `[-abc]`, `[abc-]`, `[\d-\w]`, ... all match the literal `'-'`. – Bart Kiers Jul 18 '12 at 15:36
  • @Bart: Fair enough. I think I need to use a PEG or Recursive Descent parser. – John Gietzen Jul 18 '12 at 18:14
  • @JohnGietzen, ANTLR *produces* LL based recursive descent parsers. But going for a PEG (or some other scanner-less technique) could make parsing PCRE less difficult (assuming you can find a decent PEG for your target language). Note that all the corner cases I outline in my previous comment (and more!) are all handled properly by the ANTLR grammar I posted a link to in my answer. – Bart Kiers Jul 18 '12 at 18:22