1

I am trying to read a big AWS S3 Compressed Object(gz).I don't want to read the whole object, want to read it in parts,so that i can process the uncompressed data in parallel I am reading it with GetObjectRequest with "Range" Header, where i am setting byte range. However, when i give a byte range in between (100,200), it fails with "Not in GZIP format" The reason for failure is , AWS request return a stream,however when i parse it to GZIPInputStream it fails as "GZIPInputStream" expects the first byte (GZIP_MAGIC = 0x8b1f) to confirm is it gzip , which is not present in the stream.

   GetObjectRequest rangeObjectRequest = new GetObjectRequest(<<Bucket>>, <<Key>>).withRange(100, 200);
   S3Object object = s3Client.getObject(rangeObjectRequest);
   S3ObjectInputStream rawData = object.getObjectContent();
   InputStream data =  new GZIPInputStream(rawData);

Can anyone guide the right approach?

Maverick
  • 626
  • 6
  • 11

1 Answers1

1

GZIP is a compression format in which each byte in the file depends on all of the bytes that precede it. Which means that you can't pick an arbitrary byte range out of the file and make sense of it.

If you need to read byte ranges, you'll need to store it uncompressed.

You could also create your own file storage format that stores chunks of the file as separately-compressed blocks. You could do this using the ZIP format, where each file in the archive represents a specific block size. But you'd need to implement your own ZIP directory reader to make that work.

Parsifal
  • 3,928
  • 5
  • 9
  • Does it imply i cannot read data in byte range for gzip s3 object? . My objective of reading data in chunks was to process uncompressed data in parallel."If you need to read byte ranges, you'll need to store it uncompressed." Not sure, if i actually understood this statement. I actually want to decompress the data and process it – Maverick May 28 '20 at 13:32
  • @Maverick - That's exactly what it implies. If you want to read byte ranges from a file, that file needs to be uncompressed. So you'll need to download it, un-GZIP it, and upload the uncompressed version. – Parsifal May 28 '20 at 16:52
  • If you are talking terabytes of data in a single file, and want to keep it compressed to save on storage charges, you could pre-build the splits and compress them individually. – Parsifal May 28 '20 at 16:53
  • Yes. It seems to be this way.Tried to hack around for header bytes but block are difficult to work around. https://jvns.ca/blog/2013/10/23/day-15-how-gzip-works/ – Maverick May 29 '20 at 10:31