Checking for illegal surrogates in Python 3 strings

Question

Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match:

re.search(r'[\uD800-\uDFFF]', s)

Based on the assumption that all legal surrogates would have been represented as astral code points and thus would not match, leaving out the illegal surrogates, or is there caveats and edge cases one needs to be aware of?

I'm not fluent enough in unicode to answer this specifically, but maybe the best way would be to actually decode the string and check for errors ? Probably the safest way to make sure there is no edge case. — mefyl, Sep 14 '15 at 13:01

score 3 · Answer 1 · answered Sep 14 '15 at 21:16

3

Yes, that's correct. Code units 0xD800–0xDFFF don't represent valid characters in wide Unicode strings, and in Python 3.3+ (following PEP 393) all Unicode strings are effectively wide.

answered Sep 14 '15 at 21:16

bobince

528,062
107
651
834

Checking for illegal surrogates in Python 3 strings

1 Answers1

Linked

Related