Relationship between Alnum and IsAlphabetic character classes in Java RegEx patterns

Question

Looking at the Javadoc for java.util.regex.Pattern

\p{Alnum} An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]

it appears that every character that matches \p{IsAlphabetic} should also match \p{Alnum}

However, it does not seem to be the case when the character has an accent. For example, the following assertion fails:

assertEquals("é".matches("\\p{IsAlphabetic}+"),"é".matches("\\p{Alnum}+"));

The same thing happens for other characters with accents such as ą, ó, ł, ź ż. All match \p{IsAlphabetic}+ but not \p{Alnum}+

Am I mis-interpreting the Javadoc? Or is this a bug in the documentation or implementation?

`\p{Alnum}` only matches ASCII letters and digits. Use `Pattern.UNICODE_CHARACTER_CLASS` option to make it fully Unicode aware. — Wiktor Stribiżew, Apr 18 '19 at 08:30
If you read the documentation page you referenced, you will see that `\p{Alnum}` = `[\p{Alpha}\p{Digit}]` and `\p{Alpha}` = `[\p{Lower}\p{Upper}]` and `\p{Lower}` = `[a-z]` and `\p{Upper}` = `[A-Z]` — Wiktor Stribiżew, Apr 18 '19 at 08:35

score 3 · Accepted Answer · answered Apr 18 '19 at 08:33

3

By default \p{Alnum} is treated as a POSIX character class which means it will only ever match ASCII characters. This means it will match a and 1 but not ä or ١.

The passage you quote only applies when the UNICODE_CHARACTER_CLASS flag is used.

Slightly oversimplified, this flag will turn the "old" POSIX style character classes into their equivalent Unicode character classes.

answered Apr 18 '19 at 08:33

Joachim Sauer

302,674
57
556
614

1

It's actually right there in the docs but the way it's worded,I didn't get it on the first read. Cheers! – toniedzwiedz Apr 18 '19 at 08:38

score 3 · Answer 2 · answered Apr 18 '19 at 08:40

Your quote from the documentation is fine but you missed to read the line before that table:

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

If you read the documentation page you referenced, you will see that \p{Alnum} = [\p{Alpha}\p{Digit}] and \p{Alpha} = [\p{Lower}\p{Upper}] and \p{Lower} = [a-z] and \p{Upper} = [A-Z].

So, \p{Alnum} only matches ASCII letters (and digits) when UNICODE_CHARACTER_CLASS flag is not set while \p{L} (=\p{IsAlphabetic}) matches all Unicode letters by default (no flag is necessary).

Relationship between Alnum and IsAlphabetic character classes in Java RegEx patterns

2 Answers2