4

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request. Ideally I want to use the standard library at both ends, so I am trying to use gzip.

My C# code is:

public static string Compress(string s) {
    var bytes = Encoding.Unicode.GetBytes(s);
    using (var msi = new MemoryStream(bytes))
    using (var mso = new MemoryStream()) {
        using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
            msi.CopyTo(gs);
        }
        return Convert.ToBase64String(mso.ToArray());
    }
}

The python code is:

s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)

with gzip.GzipFile(fileobj=buff) as gz:
    decompressed_data = gz.read()

It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird. If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.

I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?

eggbert
  • 3,105
  • 5
  • 30
  • 39

2 Answers2

4

I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:

decompressed_data = gz.read().decode('utf-16')

There, decompressed_data should be Unicode and you can treat it as such for further work.

UPDATE: This worked for me:

C Sharp

static void Main(string[] args)
    {
        FileStream f = new FileStream("test", FileMode.CreateNew);
        using (StreamWriter w = new StreamWriter(f))
        {
            w.Write(Compress("hello"));
        }
    }
    public static string Compress(string s)
    {
        var bytes = Encoding.Unicode.GetBytes(s);
        using (var msi = new MemoryStream(bytes))
        using (var mso = new MemoryStream())
        {
            using (var gs = new GZipStream(mso, CompressionMode.Compress))
            {
                msi.CopyTo(gs);
            }
            return Convert.ToBase64String(mso.ToArray());
        }
    }

Python

import base64
import cStringIO
import gzip

f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)

with gzip.GzipFile(fileobj=buff) as gz:
    decompressed_data = gz.read()
    print decompressed_data.decode('utf-16')

Without decode('utf-16) it printed in the console:

>>>h e l l o

with it it did well:

>>>hello

Good luck, hope this helps!

Paulo Bu
  • 29,294
  • 6
  • 74
  • 73
  • Thanks, that worked. Do I actually need to base64 encode it for a PUT request? – eggbert Jul 11 '13 at 13:24
  • I don't think so, as far as I know, base64 is used when you need to transfer binary data over a non binary environment, like XML, XMPP, etc. HTTP can handle binary so probably you won't be needing to encode with base64. – Paulo Bu Jul 11 '13 at 13:27
  • @eggbert When sending data through HTTP (using `POST`, etc.) you need to *encode* your request (if you're sending text, and not binary data). Since the format is `key1=value1&key2=value2&...`, you need to encode some characters like `=`, `&`, etc. For example, if you want to send `a=1&b=2` as a value of some key, it should be translated to `a%3D1%26b%3D2`. In C#, you can use `HttpUtility.UrlEncode` and `urllib.urlencode` in Python. – Oscar Mederos Jul 12 '13 at 06:20
2

It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}

That's because you're using Encoding.Unicode to convert the string to bytes to start with.

If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.

If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)

I'm also somewhat suspicious of this line:

buff = cStringIO.StringIO(s)

s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • The only reason for doing buff = cStringIO.StringIO(s) is to turn it into a file object because gzip.GzipFile doesn't take a string – eggbert Jul 11 '13 at 13:02
  • 1
    @eggbert: But you don't really *have* a string - you have binary data. This is where I find Python frustrating, as it treats strings and binary data as if they were equivalent in too many places. It's probably fine, but it makes me cringe. – Jon Skeet Jul 11 '13 at 13:08
  • I have to say I concur with that, it's sometimes frustrating but when you use it often you get used to it :) Python 3 tries to fix this in some fashion but still it is very easy to get confused between binary and a normal string :) (Althoug they are almost the same at the end, a bunch of bytes) – Paulo Bu Jul 11 '13 at 13:10