How to convert "\uxxxx\uxxxx" to u'\Uxxxxxxxx'?

Question

I have a text file who is filled with unicode characters as "\ud83d\udca5" but python don't seem to like them.
But if I replace it by u'\U0001f4a5' which seems to be his python escape style (Charbase), it works.

Is there a solution to convert them all into the u"\Uxxxxxxxx" escape format than python can understand ?

Thanks.

@Joey: That's not the (entire) point. There is a fundamental difference between a Unicode object and an encoded bytes sequence (encoded by UTF-16, UTF-8 or whatever else). — Tim Pietzcker, Oct 07 '16 at 10:06
Yeah, I guess so but I have UTF-16 chars in an UTF-8 file. That's the problem. — DasFranck, Oct 07 '16 at 10:06
@TimPietzcker: My comment referred mostly to "I have this stuff in an UTF-8 file and it doesn't work properly" — Joey, Oct 07 '16 at 10:10
Do you mean you have a file with literal backslashes and letter ‘u’s in? If so you need to work out what format it is and use a suitable parser for that. eg it might be JSON. — bobince, Oct 09 '16 at 09:41

score 0 · Answer 1 · answered Oct 07 '16 at 10:35

0

You're mixing up Unicode and encoded strings. u'\U0001f4a5' is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u is optional since now all strings are Unicode objects).

Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.

In your case, it seems you need to open file using the UTF-16 codec instead of UTF-8.

with open("myfile.txt", encoding="utf-16") as f:
    s = f.read()

will give you the proper contents if the codec is in fact UTF-16. If it doesn't look right, try "utf-16-le" or "utf-16-be".

answered Oct 07 '16 at 10:35

Tim Pietzcker

328,213
58
503
561

Well, I tried but when I'm opening the file with ```utf-16```, I have: ```UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9526-9527: illegal UTF-16 surrogate```. Same with ```utf-16-be```, I can open it with ```utf-8``` but with the \uxxxx\uxxxx problem. – DasFranck Oct 07 '16 at 10:43
Then it's using a different encoding altogether. Unfortunately, there is no way to reliably determine that encoding - you need to check at the source of the file. Can you post a relevant sample of the file? – Tim Pietzcker Oct 07 '16 at 11:17

How to convert "\uxxxx\uxxxx" to u'\Uxxxxxxxx'?

1 Answers1