What do I do in ANTLR if I want to parse something which is extremely configurable?

Question

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.

Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.

What is the best way to do this in ANTLR?

What I'm currently considering is something like this:

lexer grammar ExpressionLexer;

options {
    superClass = AbstractLexer;
}

DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...

abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
    fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
    // ... and so on for other tokens ...
}

Alternative approaches I am considering:

(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.

(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)

(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)

score 1 · Answer 1 · answered Mar 09 '22 at 07:55

I recommend doing that in two steps:

Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.

This way you don't need to do all kind of hacks, add platform code and so on.

For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.

So basically option (b). And yes, for programming, obviously only English formatted numbers are used, but sometimes pure maths involves functions with two parameters too, so I always wondered. — Hakanai, Mar 09 '22 at 08:49

score 1 · Answer 2 · answered Mar 09 '22 at 08:29

1

Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md

So you could define a lexer rule like this:

DIGIT
 : [\p{Digit}]
 ;

which will match both ٣ and 3.

answered Mar 09 '22 at 08:29

Bart Kiers

166,582
36
299
288

Indeed. Whether it also matches 100% of digits though is another matter... I'd have to read through Unicode to see which characters have the tag. – Hakanai Mar 09 '22 at 08:47

What do I do in ANTLR if I want to parse something which is extremely configurable?

2 Answers2