13

I have bunch of byte strings (str, not unicode, in python 2.7) containing unicode data (in utf-8 encoding).

I am trying to join them( by "".join(utf8_strings) or u"".join(utf8_strings)) which throws

UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 0: ordinal not in range(128)`

Is there any way to make use of .join() method for non-ascii strings? sure I can concatenate them in a for loop, but that wouldn't be cost-effective.

thkang
  • 11,215
  • 14
  • 67
  • 83

2 Answers2

17

Joining byte strings using ''.join() works just fine; the error you see would only appear if you mixed unicode and str objects:

>>> utf8 = [u'\u0123'.encode('utf8'), u'\u0234'.encode('utf8')]
>>> ''.join(utf8)
'\xc4\xa3\xc8\xb4'
>>> u''.join(utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
>>> ''.join(utf8 + [u'unicode object'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

The exceptions above are raised when using the Unicode value u'' as the joiner, and adding a Unicode string to the list of strings to join, respectively.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    how would one un-mix `unicode` and `str` objects then? – fiona Sep 13 '17 at 14:42
  • 1
    @fiona decide your byte strings to Unicode, then join. It's best to decode as early as possible, encode only when you are done with the text and must pass it on to something that'll only accept bytes. – Martijn Pieters Sep 13 '17 at 14:53
2

"".join(...) will work if each parameter is a str (whatever the encoding may be).

The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.

afflux
  • 151
  • 4
  • 1
    thank for your help. the `utf8_strings` are data loaded by `xlrd`. `xlrd`, a magnificent python module, thankfully returns all (non-numerical) data in `unicode`. I fiddle with them, and it seems I made some of them `str`. – thkang Feb 07 '13 at 19:08