You're mixing up Unicode and encoded strings. u'\U0001f4a5'
is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u
is optional since now all strings are Unicode objects).
Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.
In your case, it seems you need to open file using the UTF-16
codec instead of UTF-8
.
with open("myfile.txt", encoding="utf-16") as f:
s = f.read()
will give you the proper contents if the codec is in fact UTF-16
. If it doesn't look right, try "utf-16-le"
or "utf-16-be"
.