Java CRC error when using a dictionary with GZIP

Question

This is honestly frustrating because I think I know the cause but at the same time I cannot pinpoint when it is happening in my code. Basically, for this assignment, we're supposed to read in an input stream, split it into 128 byte blocks, and compress each block while using the last 32 bytes from the previous block as a dictionary.

import java.io.*;
import java.util.zip.*;

public class TestCase
{
    protected static final int BLOCK_SIZE = 128;
    protected static final int DICT_SIZE = 32;

    public static void main(String[] args)
    {
        BufferedInputStream inBytes = new BufferedInputStream(System.in);
        byte[] buff = new byte[BLOCK_SIZE];
        byte[] dict = new byte[DICT_SIZE];
        int bytesRead = 0;

        try
        {
            DGZIPOutputStream compressor = new DGZIPOutputStream(System.out);
            bytesRead = inBytes.read(buff);

            if (bytesRead >= DICT_SIZE)
            {
                System.arraycopy(buff, 0, dict, 0, DICT_SIZE);
            }

            while(bytesRead != -1) 
            {
                compressor.write(buff, 0, bytesRead);              
                if (bytesRead == BLOCK_SIZE)
                {
                    System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
                    compressor.setDictionary(dict);
                }

                bytesRead = inBytes.read(buff);
            }
            compressor.flush();         
            compressor.close();
        }
        catch (IOException e)
        {
            e.printStackTrace();
        System.exit(-1);
        }
    }

    public static class DGZIPOutputStream extends GZIPOutputStream
    {
        public DGZIPOutputStream(OutputStream out) throws IOException
        {
            super(out);
        }

        public void setDictionary(byte[] b)
        {
            def.setDictionary(b);
        }

        public void updateCRC(byte[] input)
        {
            crc.update(input);
            System.out.println("Called!");
        }                       
    }
}

I'm literally off by one single byte. I think it's that when I call write(), I know afterwards it updates the crc for the byte array. I THINK for some reason updateCRC is being called twice but I cannot for the life of me figure out where. Or maybe I'm off completely. But it's this one single byte and yet when I take off the dictionary, it works just fine so....I'm really not sure.

EDIT: So when I compile and test it:

$cat file.txt

hello world, how are you? 123efd4
KEYBOARDSMASHR#@Q)KF@_{KFSKFDS
000000000000000000000000000000000000000000000000000
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
pwfprejgewojgw
12345678901234567890
!@#$%^&*(!@#$%^&*(A

cat file.txt | java TestCase | gzip -d | cmp file.txt ; echo $?

gzip: stdin: invalid compressed data--crc error
file.txt - differ: byte 1, line 1
1

(ignore my choice of file, I was sleepy last night)

EDIT: Solved

If you add a `println` statement to `updateCRC()`, or use a debugger, then you'll soon know whether it's called twice or not. Can you explain how you know you are "off by one single byte", since we can't see what your input and output is? — DNA, Oct 27 '12 at 23:52
I think there are quite a bit of mistakes within your stream handling, such as calling `flush()` and `close()` directly after each other, only switching on a exact compare of bytes read (`bytesRead == BLOCK_SIZE`). I also don't understand the first copy into the `dict` buffer. You might want to give a specific buffer size to your BufferedInputStream too (although if `available()` of `System.in` would return 0, it would still return less than the buffer size of bytes). — Maarten Bodewes, Oct 28 '12 at 00:09
I'd lay fairly good odds that you got the assignment wrong. It is probably to split it into 128K byte blocks, using the last 32K bytes from the previous block as a dictionary for the next block. Doing 128 and 32 bytes would not be useful and would result in extremely poor compression. — Mark Adler, Oct 28 '12 at 00:29
That is the assignment, I'm just using 128 and 32 bytes for now to get it up and running, since I don't have any large files off the top. The first dict buffer is just the fact that the first block uses its own first 32 bytes as the dictionary. I can take it off, but it still doesn't work. As for the bytesRead == BLOCK_SIZE, that is to avoid trying to create a dictionary from the last block, which may not have 32 bytes at all — user1777900, Oct 28 '12 at 00:33
Ah, never mind. I got it. Since owlstead commented on it, I noticed that there was an alignment issue. It was solved by just changing my super(out) to super(out, flush) so I could use the GZIPOutputStream constructor that had the syncflush option. It looks like setting the mode to Deflater.SYNC_FLUSH solved the issue — user1777900, Oct 28 '12 at 01:22

Java CRC error when using a dictionary with GZIP

0 Answers0