0

I have a text file who is filled with unicode characters as "\ud83d\udca5" but python don't seem to like them.
But if I replace it by u'\U0001f4a5' which seems to be his python escape style (Charbase), it works.

Is there a solution to convert them all into the u"\Uxxxxxxxx" escape format than python can understand ?

Thanks.

DasFranck
  • 379
  • 3
  • 11
  • 1
    That's because that's UTF-16, not UTF-8. – Joey Oct 07 '16 at 10:01
  • @Joey: That's not the (entire) point. There is a fundamental difference between a Unicode object and an encoded bytes sequence (encoded by UTF-16, UTF-8 or whatever else). – Tim Pietzcker Oct 07 '16 at 10:06
  • Yeah, I guess so but I have UTF-16 chars in an UTF-8 file. That's the problem. – DasFranck Oct 07 '16 at 10:06
  • @TimPietzcker: My comment referred mostly to "I have this stuff in an UTF-8 file and it doesn't work properly" – Joey Oct 07 '16 at 10:10
  • Do you mean you have a file with literal backslashes and letter ‘u’s in? If so you need to work out what format it is and use a suitable parser for that. eg it might be JSON. – bobince Oct 09 '16 at 09:41

1 Answers1

0

You're mixing up Unicode and encoded strings. u'\U0001f4a5' is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u is optional since now all strings are Unicode objects).

Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.

In your case, it seems you need to open file using the UTF-16 codec instead of UTF-8.

with open("myfile.txt", encoding="utf-16") as f:
    s = f.read()

will give you the proper contents if the codec is in fact UTF-16. If it doesn't look right, try "utf-16-le" or "utf-16-be".

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Well, I tried but when I'm opening the file with ```utf-16```, I have: ```UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9526-9527: illegal UTF-16 surrogate```. Same with ```utf-16-be```, I can open it with ```utf-8``` but with the \uxxxx\uxxxx problem. – DasFranck Oct 07 '16 at 10:43
  • Then it's using a different encoding altogether. Unfortunately, there is no way to reliably determine that encoding - you need to check at the source of the file. Can you post a relevant sample of the file? – Tim Pietzcker Oct 07 '16 at 11:17