3

Looking at the Javadoc for java.util.regex.Pattern

\p{Alnum} An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]

it appears that every character that matches \p{IsAlphabetic} should also match \p{Alnum}

However, it does not seem to be the case when the character has an accent. For example, the following assertion fails:

assertEquals("é".matches("\\p{IsAlphabetic}+"),"é".matches("\\p{Alnum}+"));

The same thing happens for other characters with accents such as ą, ó, ł, ź ż. All match \p{IsAlphabetic}+ but not \p{Alnum}+

Am I mis-interpreting the Javadoc? Or is this a bug in the documentation or implementation?

toniedzwiedz
  • 17,895
  • 9
  • 86
  • 131

2 Answers2

3

By default \p{Alnum} is treated as a POSIX character class which means it will only ever match ASCII characters. This means it will match a and 1 but not ä or ١.

The passage you quote only applies when the UNICODE_CHARACTER_CLASS flag is used.

Slightly oversimplified, this flag will turn the "old" POSIX style character classes into their equivalent Unicode character classes.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
3

Your quote from the documentation is fine but you missed to read the line before that table:

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

If you read the documentation page you referenced, you will see that \p{Alnum} = [\p{Alpha}\p{Digit}] and \p{Alpha} = [\p{Lower}\p{Upper}] and \p{Lower} = [a-z] and \p{Upper} = [A-Z].

So, \p{Alnum} only matches ASCII letters (and digits) when UNICODE_CHARACTER_CLASS flag is not set while \p{L} (=\p{IsAlphabetic}) matches all Unicode letters by default (no flag is necessary).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563