1

I'm trying to implement a rule along the lines of "all characters in the Letter and Symbol Unicode categories except a few reserved characters." From the lexer rules, I know I can use \p{___} to match against Unicode categories, but I am unsure of how to handle excluding certain characters.

Looking at example grammars, I am led a few different directions. For example, the Java 9 grammar seems to use predicates in order to directly use Java's built in isJavaIdentifier() while others manually define every valid character.

How can I achieve this functionality?

Panda
  • 877
  • 9
  • 21

1 Answers1

0

Without target specific code, you will have to define the ranges yourself so that the chars you want to exclude are not part of these ranges. You cannot use \p{...} and then exclude certain characters from it.

With target specific code, you can do as in the Java 9 grammar:

@lexer::members {
  boolean aCustomMethod(int character) {
    // Your logic to see if 'character' is valid. You're sure
    // that it's at least a char from \p{Letter} or \p{Symbol}
    return true;
  }
}

TOKEN
 : [\p{Letter}\p{Symbol}] {aCustomMethod(_input.LA(-1))}?
 ;
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288