1

I'm trying to use ftfy Python package to fix unicode errors in a csv file but it fails at lines that contains \xa0

I don't understand why this is happning and how should it be properly fixed!

Here is an example that is causing problem:

>>> txt = 'Linköpings Universitet, LiU'
>>> ftfy.explain_unicode(txt)
U+004C  L       [Lu] LATIN CAPITAL LETTER L
U+0069  i       [Ll] LATIN SMALL LETTER I
U+006E  n       [Ll] LATIN SMALL LETTER N
U+006B  k       [Ll] LATIN SMALL LETTER K
U+00C3  Ã       [Lu] LATIN CAPITAL LETTER A WITH TILDE
U+00B6  ¶       [Po] PILCROW SIGN
U+0070  p       [Ll] LATIN SMALL LETTER P
U+0069  i       [Ll] LATIN SMALL LETTER I
U+006E  n       [Ll] LATIN SMALL LETTER N
U+0067  g       [Ll] LATIN SMALL LETTER G
U+0073  s       [Ll] LATIN SMALL LETTER S
U+0020          [Zs] SPACE
U+0055  U       [Lu] LATIN CAPITAL LETTER U
U+006E  n       [Ll] LATIN SMALL LETTER N
U+0069  i       [Ll] LATIN SMALL LETTER I
U+0076  v       [Ll] LATIN SMALL LETTER V
U+0065  e       [Ll] LATIN SMALL LETTER E
U+0072  r       [Ll] LATIN SMALL LETTER R
U+0073  s       [Ll] LATIN SMALL LETTER S
U+0069  i       [Ll] LATIN SMALL LETTER I
U+0074  t       [Ll] LATIN SMALL LETTER T
U+0065  e       [Ll] LATIN SMALL LETTER E
U+0074  t       [Ll] LATIN SMALL LETTER T
U+002C  ,       [Po] COMMA
U+00A0  \xa0    [Zs] NO-BREAK SPACE
U+004C  L       [Lu] LATIN CAPITAL LETTER L
U+0069  i       [Ll] LATIN SMALL LETTER I
U+0055  U       [Lu] LATIN CAPITAL LETTER U
>>> print(ftfy.fix_text(txt))
Linköpings Universitet, LiU

Testing on a substring that doesn't contains \xa0 works correctly:

>>> print(ftfy.fix_text(txt[:24]))
Linköpings Universitet,

Replacing the \xa0 with space also works:

>>> print(ftfy.fix_text(txt.replace('\xa0',' ')))
Linköpings Universitet, LiU

I'm not sure if this is the correct way to solve this and if it safe to use without missing up other things?

  • I would file an issue for [ftfy](https://github.com/LuminosoInsight/python-ftfy/issues/) - if they can't fix it, they may explain it. – MrBean Bremen May 30 '20 at 20:26
  • It's because the `ö` character is in _UTF-8_ and presented by a [mojibake](https://en.wikipedia.org/wiki/Mojibake) as `'ö'.encode('utf-8').decode('cp1252')` (result `'ö'`). However, the _No-Break Space_ character is in _cp1252_ and is not in _UTF-8_ as `'\xA0'.encode('utf-8').decode('cp1252')` (returns `Â\xa0`). So use consistently either _cp1252_ or (better) [UTF-8 Everywhere](https://utf8everywhere.org/). – JosefZ Feb 09 '21 at 19:21

0 Answers0