1

The problem: I have an older code that it's using Py2 'str' and that is using gzip to compress that string and I want to have the same output from gzip from the same string in Py3 but I can't manage to make it work.

Python 2 code

#input_buffer is a str 
string_buffer = StringIO()
gzip_file = GzipFile(fileobj=string_buffer, mode='w', compresslevel = 6)
gzip_file.write(input_buffer)
gzip_file.close()
out_buffer = string_buffer.getvalue()

Now I tried to migrate the same code in Py3 and expect the exact same result

Python 3 code

#input_buffer is a the exact same string that I have on Py2
string_buffer = BytesIO()
gzip_file = GzipFile(fileobj=string_buffer, mode=u'w', compresslevel = 6)
gzip_file.write(bytes(input_buffer, 'utf-8'))
gzip_file.close()
out_buffer = string_buffer.getvalue()

What I've noticed is that once I make the 'str' a Bytes array it adds extra characters, characters that are later compressed and seen in the final result, even after I decode the code. Also decoding without 'ignore' flag will fail because some characters are bigger than expected.

Any solution for my problem?

To summarize: I have a str and I want from Py2 and Py3 gzip compression to have the exact same output. In practice it doesn't work at least from what I've tried.

Thanks

One problem that I see is that even though they have the same values they are represented different and the only way I want the result to look like is like in Python2

Python3
input_buffer='+\n\x01I\x12Default_Source©$c1f33163-ff63-13e6-bd74-d90d67f22ac4Ñ\x06\x80\x9dº\x9fÌVÐ\x07\x02Ë\x08\n\x01)$'
out_buffer =b'\x1f\x8b\x08\x00\x00x\xb0X\x02\xff\xd3\xe6b\xf4\x14rIMK,\xcd)\x89\x0f\xce/-JN=\xb4R%\xd90\xcd\xd8\xd8\xd0\xccX7-\rH\x18\x1a\xa7\x9a\xe9&\xa5\x98\x9b\xe8\xa6X\x1a\xa4\x98\x99\xa7\x19\x19%&\x9b\x1c\x9e\xc8v\xa8\xe1\xd0\xdcC\xbb\x0e\xcd?\xdc\x13vx\x02;\xd3\xe1n\x0e.FM\x15\x00\x03&\xcf\x15S\x00\x00\x00'

Python2
input_buffer='+\n\x01I\x12Default_Source\xa9$c1f33163-ff63-13e6-bd74-d90d67f22ac4\xd1\x06\x80\x9d\xba\x9f\xccV\xd0\x07\x02\xcb\x08\n\x01)$'
out_buffer ='\x1f\x8b\x08\x00\xae|\xb0X\x02\xff\xd3\xe6b\xf4\x14rIMK,\xcd)\x89\x0f\xce/-JN]\xa9\x92l\x98fllhf\xac\x9b\x96\x06$\x0c\x8dS\xcdt\x93R\xccMtS,\rR\xcc\xcc\xd3\x8c\x8c\x12\x93M.\xb25\xcc\xdd5\xffL\xd8\x05v\xa6\xd3\x1c\\\x8c\x9a*\x00\xe9l\xf0\xeaJ\x00\x00\x00'
TuringTux
  • 559
  • 1
  • 12
  • 26
Mark
  • 1,100
  • 9
  • 17
  • 1
    Can you give an example of an `input_buffer` that is producing the problem? – onlynone Feb 24 '17 at 18:08
  • 1
    gzip files are binary files. it doesn't make sense to decode the bytes `string_buffer` as utf-8. – Daniel Feb 24 '17 at 18:10
  • @onlynone I posted an example. Even though they look different, that's the way the same string is represented in Py3 vs Py2 but at least for the out_buffer I am only interested to see the result as in Py2 – Mark Feb 24 '17 at 18:31
  • @onlynone I put the wrong example, I'll put the correct one soon – Mark Feb 24 '17 at 18:32
  • @onlynone I've attached an example. They look different the input strings but they are the same, the reason is the way 'str' differs in Py2 vs Py3 but if you take character by character and a a ord(char) on each they will be the same. – Mark Feb 24 '17 at 18:40
  • I tested your code, with the input as "Hello World". Your code works fine, but your debugging is flawed. As Daniel noted, it makes no sense to try and `.decode('utf-8')` the bytes produced by gzip compression. You have the bytes, keep them that way. In python2 you didn't have to understand bytes versus characters, and sometimes could get away with it. In python3 you can't get away with it. – dsh Feb 24 '17 at 18:40
  • @dsh please try my example. The problem appears when a character is represented different in Py3 vs Py2 even though it has the same ord – Mark Feb 24 '17 at 18:41
  • I see the data you just posted. Your input buffers do not look similar ... it appears that you may have additional encoding issues with your environment and your source code. Specifically, your py2 input_buffer is a sequence of bytes, but your py3 input_buffer is characters that need an accompanying encoding. Perhaps your source file doesn't specify to python what encoding you wrote it in? – dsh Feb 24 '17 at 18:43
  • @dsh The strings are the same:© is \xa9 but the problem is that Py2 can't show that and Py3 it does because it suports unicode – Mark Feb 24 '17 at 18:46
  • 2
    In Python2 `input_buffer` are bytes, your characters are latin1 encoded. In Python3 you have a string, with unicode, which you encode in utf8. To get the same result, you have to encode in python3 to latin1: `gzip_file.write(bytes(input_buffer, 'latin1'))` – Daniel Feb 24 '17 at 18:47
  • @daniel You are the men! It did solved my issue. Thanks a lot!. Not sure how to save your answer as the correct one but it is. Thanks again! – Mark Feb 24 '17 at 18:53
  • Daniel pegged it first. Your input to gzip is different. \xa9 is not a valid UTF-8 byte sequence. `UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte` – dsh Feb 24 '17 at 18:54

1 Answers1

5

In Python2 input_buffer are bytes, and the character encoding is latin1. In Python3 you have a string, with unicode, which you encode as utf-8. To get the same result, you have to encode in Python 3 to latin1:

input_buffer = '+\n\x01I\x12Default_Source©$c1f33163-ff63-13e6-bd74-d90d67f22ac4Ñ\x06\x80\x9dº\x9fÌVÐ\x07\x02Ë\x08\n\x01)$'
string_buffer = BytesIO()
with GzipFile(fileobj=string_buffer, mode='w', compresslevel=6) as gzip_file:
    gzip_file.write(bytes(input_buffer, 'latin1'))
out_buffer = string_buffer.getvalue()
Daniel
  • 42,087
  • 4
  • 55
  • 81