The following Python code (version 3.11.0) gives an unexpected result:
import re
import sys
s = ''.join(map(chr, range(sys.maxunicode + 1)))
matches = ''.join(re.findall('[a-z]', s, re.IGNORECASE))
print(matches)
It prints the extra 4 non-ASCII characters 'İıſK'
:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK
This is actually documented, but without any explanation as to why it behaves like this:
Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.
I could maybe understand matching against the Kelvin sign, but the others make no sense to me. Is this just a bug or is there a deeper reason why it should behave like this?