2

Any idea why Java's GZIPOutputStream compressed string is different from my .NET's GZIP compressed string?

Java Code:

package com.company;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Base64;

public class Main {

    public static void main(String[] args) {
        String myValue = "<Grid type=\"mailing_activity_demo\"><ReturnFields><DataElement>mailing_id</DataElement></ReturnFields></Grid>";

        int length = myValue.length();

        byte[] compressionResult = null;

        try {
            compressionResult = MyUtils.compress(myValue);
        } catch (IOException e) {
            e.printStackTrace();
        }

        byte[] headerBytes = ByteBuffer.allocate(4).putInt(length).array();

        byte[] fullBytes = new byte[headerBytes.length + compressionResult.length];

        System.arraycopy(headerBytes, 0, fullBytes, 0, headerBytes.length);

        System.arraycopy(compressionResult, 0, fullBytes, headerBytes.length, compressionResult.length);

        String result = Base64.getEncoder().encodeToString(fullBytes);
        System.out.println((result));
    }
}




package com.company;

import javax.sound.sampled.AudioFormat;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.zip.GZIPOutputStream;

public class MyUtils
{

    private static Object BitConverter;

    public static byte[] compress(String data) throws IOException
    {
        ByteBuffer buffer = StandardCharsets.UTF_8.encode(data);
        System.out.println(buffer.array().length);
        System.out.println(data.length());
        ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());

        GZIPOutputStream gzip = new GZIPOutputStream(bos);

        gzip.write(data.getBytes());

        gzip.close();

        byte[] compressed = bos.toByteArray();

        bos.close();

        return compressed;

    }

}

The string that I get from above is:

AAAAbB+LCAAAAAAAAP+zcS/KTFEoqSxItVXKTczMycxLj09MLsksyyypjE9Jzc1XsrMJSi0pLcpzy0zNSSm2s3FJLEl0zUnNTc0rsYPpyEyx0UcWt9FH1aMPssUOAKHavIJsAAAA

from the .NET c# code:

    public static string CompressData(string data)
    {
        using (MemoryStream memoryStream = new MemoryStream())
        {
            byte[] plainBytes = Encoding.UTF8.GetBytes(data);

            using (GZipStream zipStream = new GZipStream(memoryStream, CompressionMode.Compress, leaveOpen: true))
            {
                zipStream.Write(plainBytes, 0, plainBytes.Length);
            }

            memoryStream.Position = 0;

            byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

            Buffer.BlockCopy(
                BitConverter.GetBytes(plainBytes.Length),
                0,
                compressedBytes,
                0,
                CompressedMessageHeaderLength
            );

            // Add the header, which is the length of the compressed message.
            memoryStream.Read(compressedBytes, CompressedMessageHeaderLength, (int)memoryStream.Length);

            string compressedXml = Convert.ToBase64String(compressedBytes);

            return compressedXml;
        }
    }

Compressed string:

bAAAAB+LCAAAAAAABACzcS/KTFEoqSxItVXKTczMycxLj09MLsksyyypjE9Jzc1XsrMJSi0pLcpzy0zNSSm2s3FJLEl0zUnNTc0rsYPpyEyx0UcWt9FH1aMPssUOAKHavIJsAAAA

Any idea what am I doing wrong in Java code?

canton7
  • 37,633
  • 3
  • 64
  • 77
John
  • 75
  • 1
  • 6
  • 3
    Why do think you are doing anything "wrong". Is it a valid zip with about the same compression rate and size? It may well be a minor difference in some parameter. I wouldn't conclude you are doing something "wrong" just because the two result do not match 100%. In fact, I would have expected so. – Fildor May 06 '21 at 15:48
  • The very first base64 character that's different is part of the header which you write to the stream -- nothing to do with gzip. You're the only one who can determine why `plainBytes.Length` is different in those two cases. – canton7 May 06 '21 at 15:49
  • See : https://www.journaldev.com/966/java-gzip-example-compress-decompress-file?force_isolation=true – jdweng May 06 '21 at 15:52
  • 2
    Actually, that looks like an endianness issue? The length is being written into the header with different endiannesses. .NET's BitConverter.GetBytes will use your machine's endianness, which is little-endian for x86, but Java's ByteBuffer defaults to big-endian. Either configure the ByteBuffer to use little endian, or tell .NET to use big endian – canton7 May 06 '21 at 15:55
  • I rolled your question back to a version which includes the compressed strings, and the different code which .NET and Java are using to write the header. These were both important pieces of information which were necessary in order to solve this problem. Editing the question so that it is no longer solvable, and so that the accepted answer doesn't make sense, doesn't help anyone. – canton7 May 07 '21 at 08:08

1 Answers1

2

To add to @MarcGravell's answer about differences in GZip encoding, it's worth noting that it looks like you've got an endianness issue with your header bytes, which will be messing up a decoder.

Your header is 4 bytes, which encodes to 5 1/3 base64 characters. The .NET version outputs bAAAAB (the first 4 bytes of which are 6c 00 00 00), whereas the Java version outputs AAAAbB (the first 4 bytes of which are 00 00 00 6c). The fact that the b is moving by around 5 characters among a sea of A's is your first clue (A represents 000000 in base64), but decoding it makes the issue obvious.

.NET's BitConverter uses your machine architecture's endianness, which on x86 is little-endian (check BitConverter.IsLittleEndian). Java's ByteBuffer defaults to big-endian, but is configurable. This explains why one is writing little-endian, and the other big-endian.

You'll want to decide on an endianness, and then align both sides. You can change the ByteBuffer to use little-endian by calling .order(ByteBuffer.LITTLE_ENDIAN). In .NET, you can use BinaryPrimitives.WriteInt32BigEndian / BinaryPrimitives.WriteInt32LittleEndian to write with an explicit endianness if you're using .NET Core 2.1+, or use IPAddress.HostToNetworkOrder to switch endianness if necessary (depending on BitConverter.IsLittleEndian) if you're stuck on something earlier.

canton7
  • 37,633
  • 3
  • 64
  • 77
  • @MarcGravell You've deleted your answer! I'd consider undeleting it -- it's a good explanation for the rest of the differences – canton7 May 06 '21 at 16:09
  • Changed it to the following, but still getting the same string. Is that how I should be doing? ByteBuffer buffer = StandardCharsets.UTF_8.encode(data); buffer.order(ByteOrder.LITTLE_ENDIAN); – John May 06 '21 at 16:15
  • 1
    @John the bit you care about is `byte[] headerBytes = ByteBuffer.allocate(4).putInt(length).array();` -- that needs to be little-endian. That's the bit which writes the header bytes, and which is using the wrong endianness. E.g. `byte[] headerBytes = ByteBuffer.allocate(4).order(ByteBuffer.LITTLE_ENDIAN).putInt(length).array();`. There doesn't seem to be any problem with the gzip'd bytes (and I wouldn't expect there to be, as the `GZIPOutputStream` will be dealing with individual bytes, where endianness is irrelevant) – canton7 May 06 '21 at 16:17
  • @John Good to hear! Please consider upvoting / accepting this answer, if it solved your problem – canton7 May 06 '21 at 16:56