Exclude some characters in Unicode category

Question

I'm trying to implement a rule along the lines of "all characters in the Letter and Symbol Unicode categories except a few reserved characters." From the lexer rules, I know I can use \p{___} to match against Unicode categories, but I am unsure of how to handle excluding certain characters.

Looking at example grammars, I am led a few different directions. For example, the Java 9 grammar seems to use predicates in order to directly use Java's built in isJavaIdentifier() while others manually define every valid character.

How can I achieve this functionality?

score 0 · Accepted Answer · answered Mar 06 '18 at 07:09

Without target specific code, you will have to define the ranges yourself so that the chars you want to exclude are not part of these ranges. You cannot use \p{...} and then exclude certain characters from it.

With target specific code, you can do as in the Java 9 grammar:

@lexer::members {
  boolean aCustomMethod(int character) {
    // Your logic to see if 'character' is valid. You're sure
    // that it's at least a char from \p{Letter} or \p{Symbol}
    return true;
  }
}

TOKEN
 : [\p{Letter}\p{Symbol}] {aCustomMethod(_input.LA(-1))}?
 ;

Exclude some characters in Unicode category

1 Answers1