How to strip unicode "punctuation" from Python string

Question

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

That's not punctuation. Its existence indicates a gross failure in some upstream process. — John Machin, Mar 24 '11 at 05:11
Seriously, how on earth did you end up with a BOM as the seventh character? — Josh Lee, Mar 24 '11 at 05:32

score 11 · Accepted Answer · 2011-03-24T04:52:36.010

11

You may use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

See also:

in your case :

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters.

edited Mar 24 '11 at 04:52

answered Mar 24 '11 at 04:45

1

Thanks for all these suggestions. They've been most useful! – Dave Fultz Mar 26 '11 at 22:20
1

Oh, yes, and in answer to the questions in the other remarks the source strings are from webpages (where as you know, anything can happen). – Dave Fultz Mar 26 '11 at 22:22
1

+1 for an excellent answer. This strikes me as the proper way of doing it. – El Zorko May 12 '11 at 21:41

score 1 · Answer 2 · answered Mar 24 '11 at 04:42

1

That's also the byte-order mark, BOM. Just cleanup your strings first to eliminate those, using something like:


>>> f = u'France\ufeff'
>>> f
u'France\ufeff'
>>> print f
France
>>> f.replace(u'\ufeff', '')
u'France'
>>> f.strip(u'\ufeff')
u'France'

answered Mar 24 '11 at 04:42

jcomeau_ictx

37,688
6
92
107

This won't remove arbitrary "crap characters". – Mar 24 '11 at 04:53
that is true. I was only addressing the one issue described. – jcomeau_ictx Mar 24 '11 at 04:54
1

It's only a BOM at the start of a string. – John Machin Mar 24 '11 at 05:07
that's true also, but then I don't know how the data was obtained in the first place; perhaps a concatenation of files made with Notepad on a Windows box. – jcomeau_ictx Mar 24 '11 at 05:08

score 1 · Answer 3 · answered Mar 24 '11 at 04:56

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.

Oh, and this whitelist method isn't limited to using regular expressions. If you're okay with something like "any unicode character that's not punctuation" then you could iterate through the string checking the characters against unicodedata.category(...) that pynator suggested. — Nathan Stocks, Mar 24 '11 at 05:01

How to strip unicode "punctuation" from Python string

3 Answers3