Regex for prohibiting surrogates in python

Question

I'm writing a regex to match the following condition:

shall not specify a character whose short identifier is less than 00A0 other than 0024 ( $ ), 0040 ( @ ), or 0060 (‘), nor one in the range D800 through DFFF inclusive.

I wrote the following regex:

PATTERN = ([\u0024\u0040\u0060]|(?![\u0000-\u00A0])|(?![\u8000-\udfff]))

and use it for search as follows

str = #some str
search = re.search(PATTERN, str, re.UNICODE)

The thing that I'm confused by is that \u8000 - \udfff are surrogate

DEMO.

But running such regex in my script seems to work fine. Is it a correct way to use regex to filter out such characters?

score 2 · Accepted Answer · answered Oct 06 '19 at 06:11

After digging some, I have found this answer: https://stackoverflow.com/a/32574077/12167858

In short: The characters in that range are simply not a thing in wide Unicode strings, at least in python 3. The execution of your regex works because no such characters are contained to begin with. Python seems to ignore the illogical command and move on. But because of this, regex101 marks it as an error despite running fine.

To answer your question: Well yes, but no as well. It simply won't do a thing. I would suggest removing the \u8000-\udfff part.

Regex for prohibiting surrogates in python

1 Answers1