7

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Dave Fultz
  • 73
  • 1
  • 3

3 Answers3

11

You may use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

See also:

in your case :

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters.

1

That's also the byte-order mark, BOM. Just cleanup your strings first to eliminate those, using something like:


>>> f = u'France\ufeff'
>>> f
u'France\ufeff'
>>> print f
France
>>> f.replace(u'\ufeff', '')
u'France'
>>> f.strip(u'\ufeff')
u'France'
jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107
1

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.

Nathan Stocks
  • 2,096
  • 3
  • 20
  • 31
  • 1
    Oh, and this whitelist method isn't limited to using regular expressions. If you're okay with something like "any unicode character that's not punctuation" then you could iterate through the string checking the characters against unicodedata.category(...) that pynator suggested. – Nathan Stocks Mar 24 '11 at 05:01