3

The following Python code (version 3.11.0) gives an unexpected result:

import re
import sys

s = ''.join(map(chr, range(sys.maxunicode + 1)))
matches = ''.join(re.findall('[a-z]', s, re.IGNORECASE))
print(matches)

It prints the extra 4 non-ASCII characters 'İıſK':

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK

This is actually documented, but without any explanation as to why it behaves like this:

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.

I could maybe understand matching against the Kelvin sign, but the others make no sense to me. Is this just a bug or is there a deeper reason why it should behave like this?

jsbueno
  • 99,910
  • 10
  • 151
  • 209
Wood
  • 271
  • 1
  • 8

1 Answers1

2

Those characters are considered (at least in some situations/locales) to be lower-/upper-case variants of the "traditional" ASCII a-z characters:

(See the "Uppercase Character" and "Lowercase Character" entries on these pages, which are directly taken from the Unicode data set).

Why are these "non-default" characters marked this way? Because in some sense or in some locales those are actually, valid relatives. For example due to the existence of the Dotless I both dotted and dotless variants of I exist in upper and in lower case and cause frequent problems in software. Similarly if you had a text that contained a ſ and you wanted to convert it to upper-case, then S would be the most appropriate candidate.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614