-1

I have a bytes-like object something like:

aa = b'abc\u6df7\u5408def.mp3'

I want to save it into a file in binary mode. the codes are below, but not work well

if __name__=="__main__":
    aa = b'abc\u6df7\u5408def.mp3'
    print(aa.decode('unicode-escape'))

    with open('database.bin', "wb") as datafile:
        datafile.write(aa)

the data in file is like that:

enter image description here

but i want the right format is like this, unicodes in binary data:

enter image description here

How can i convert the bytes to save it in file?

  • first step, we can convert aa to bb = b'abc\\xf7\\x6d\\x08\\x54def.mp3', then datafile.write(bb). but how to do that? – Castle Odinland Dec 04 '18 at 07:09
  • Your input is not what you hope it to be. It seems like you want `aa = 'abc\u6df7\u5408def.mpi'.encode('utf-8')` to initialize the byte string. The byte string `b\\u'` is simply `b'\\'` (backslash) followed by `b'u'` (byte string with lowercase letter `u`). – tripleee Dec 04 '18 at 07:43
  • change aa to aa = b'abc\u6df7\u5408def.mp3', the problem still. – Castle Odinland Dec 04 '18 at 07:53
  • No, `\u` in a byte string isn't well-defined. If you want the UTF-8 encoding of a Unicode string, you have to say so. – tripleee Dec 04 '18 at 07:56
  • Actually, you seem to want `utf-16be` but the rest still applies trivially. – tripleee Dec 04 '18 at 07:57
  • i have a question: is there any difference between b'abc\u6df7\u5408def.mp3' and b'abc\xf7\x6d\x08\x54def.mp3'? And How to convert from one to another? – Castle Odinland Dec 04 '18 at 08:40
  • One of them is not well-defined. To convert from a string to a byte string, `encode` using the encoding you want. To go the other way, `decode`; then, you obviously have to know (or guess correctly) the encoding. – tripleee Dec 04 '18 at 08:50

1 Answers1

0

\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).

There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.

aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')

With that out of the way, I believe the rest of your code should work as expected.

Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).

However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

tripleee
  • 175,061
  • 34
  • 275
  • 318