Regex match invalid Unicode characters

Question

I have strings like this:
ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ
and I want to filter out all these invalid characters beginning with a slash, which I am trying to do with regex in Python.

It does work like this:

re.sub(r",\u0f6e,", r",deleted,", s)

But not like this:

re.sub(r",\.{5},", r",deleted,", s)

It should work according to http://pythex.org, so I guess it's because they are invalid characters? How can I match them?

Edit: @metatoaster said my question is ambiguous: The problem seems to arise because the input string s is not a raw string.

>>> s = ' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'

Just out of curiosity, why are the Unicode characters that don't display as single character invalid? These are the same as the Unicode characters that are represented by a single character, just represented in a different way. The resulting character can be seen when you print out the string and your font supports those Unicode characters. — 3limin4t0r, Nov 09 '18 at 21:48
I recommend taking a look at the [Python Unicode HOWTO](https://docs.python.org/3.7/howto/unicode.html). — 3limin4t0r, Nov 09 '18 at 21:56

Mark Tolonen · Accepted Answer · 2018-11-10T05:16:17.283

3

It seems you have a string with undefined Unicode codepoints. \u0f6e is a single code point represented as an escape code. Example:

>>> s = 'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> s
'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> print(s)
ꐊ,ꀵ,཮,ⴗ,ꦚ,⵵,ꢯ,⾌,꥽,⩱,ㇴ,⵮,鼺, Ꞁ

Note how printing the string shows the character as an undefined box. It is displayed as an escape code for debugging purposes. These code points have a few things in common. According to the Unicode database, they are category C (control) codepoints. They also don't have names. A quick way to filter is:

>>> ''.join(['deleted' if ud.category(c)[0] == 'C' else c for c in s])
'ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,deletedꞀ'

edited Nov 10 '18 at 05:16

answered Nov 09 '18 at 21:18

Mark Tolonen

166,664
26
169
251

the objective of the OP seems to be "filter out" the invalid characters. It seems that using regexps is unnecessary at all, as a simple loop over the string characters using these techniques would suffice. – jsbueno Nov 10 '18 at 00:57
1

@jsbueno Yes, I just regurgitated the OP's attempt with some filtering. With some timings to back it up, `''.join(['deleted' if ud.category(c)[0] == 'C' else c for c in s])` is the fastest way. – Mark Tolonen Nov 10 '18 at 05:14

score 0 · Answer 2 · answered Nov 08 '18 at 23:29

I don't see how your first re.sub statement would have worked, if your string was truly defined as is.

>>> s = r' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)                                        
' ꐊ,ꀵ,\\u0f6e,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'

Note how the first r'\u0f6e' remains. In regex, the \ character is also special so it must also be escaped. This can be done by using \\ instead. Now try:

>>> re.sub(r",\\u0f6e,", r",deleted,", s)                                       
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'

In order to match the actual expression and not more than necessary, do note that the \\u sequence has exactly 4 subsequent characters between 0-9 and a-f. Instead of trying to match any 5 characters, be more specific, like:

>>> re.sub(r",\\u[0-9a-f]+,", r",deleted,", s)                                  
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,\\x00Ꞁ'

Note that this entire answer assumes the information you have given us is correct, and the escape sequences are actually the backslash character. It would be useful to update your question to include these code fragments like I had here to be less ambiguous about what is being done (as we can copy-paste your code and run it to see what went wrong and we can also correct it more easily).

Regex match invalid Unicode characters

2 Answers2