Why does the regex "[a-z]" match against the non-ASCII characters "İıſK" when the case-insensitive flag is used?

Question

The following Python code (version 3.11.0) gives an unexpected result:

import re
import sys

s = ''.join(map(chr, range(sys.maxunicode + 1)))
matches = ''.join(re.findall('[a-z]', s, re.IGNORECASE))
print(matches)

It prints the extra 4 non-ASCII characters 'İıſK':

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK

This is actually documented, but without any explanation as to why it behaves like this:

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.

I could maybe understand matching against the Kelvin sign, but the others make no sense to me. Is this just a bug or is there a deeper reason why it should behave like this?

Joachim Sauer · Accepted Answer · 2022-11-11T08:45:43.890

Those characters are considered (at least in some situations/locales) to be lower-/upper-case variants of the "traditional" ASCII a-z characters:

U+0130 Latin Capital Letter I with Dot Above İ is a upper-case variant of i
Similarly U+0131 Latin Small Letter Dotless I ı is considered the lower-case variant of I
U+017F Latin Small Letter Long S ſ is a lower-case variant of S
U+212A Kelvin Sign K is a upper-case variant of k

(See the "Uppercase Character" and "Lowercase Character" entries on these pages, which are directly taken from the Unicode data set).

Why are these "non-default" characters marked this way? Because in some sense or in some locales those are actually, valid relatives. For example due to the existence of the Dotless I both dotted and dotless variants of I exist in upper and in lower case and cause frequent problems in software. Similarly if you had a text that contained a ſ and you wanted to convert it to upper-case, then S would be the most appropriate candidate.

Why does the regex "[a-z]" match against the non-ASCII characters "İıſK" when the case-insensitive flag is used?

1 Answers1