Ruby parsing gzip binary string

Question

I have a binary string that holds two gzip binarys concatenated. (I am reading a binary file log file that concatenated two gzip files together)

In other words, I have the equivalient of:

require 'zlib'
require 'stringio'

File.open('t1.gz', 'w') do |f|
  gz = Zlib::GzipWriter.new(f)
  gz.write 'part one'
  gz.close
end

File.open('t2.gz', 'w') do |f|
  gz = Zlib::GzipWriter.new(f)
  gz.write 'part 2'
  gz.close
end


contents1 = File.open('t1.gz', "rb") {|io| io.read }
contents2 = File.open('t2.gz', "rb") {|io| io.read }

c = contents1 + contents2

gz = Zlib::GzipReader.new(StringIO.new(c))

gz.each do | l |
    puts l
end

When I try to unzip the combined string, I only get the first string. How do I get both strings?

first off it would help to have the actual code you are using, rather than some approximation of it. Secondly, how are you unzipping the gripped data? — Frederick Cheung, Jan 10 '12 at 16:00
@FrederickCheung He's unzipping through GzipReader. And this code is probably his actual code, just without unnecessary and confusing business logic. — WattsInABox, Apr 16 '14 at 14:12

score 3 · Accepted Answer · answered Jan 10 '12 at 17:07

3

while c
  io = StringIO.new(c)
  gz = Zlib::GzipReader.new(io)
  gz.each do | l |
    puts l
  end
  c = gz.unused   # take unprocessed portion of the string as the next archive
end

See ruby-doc.

answered Jan 10 '12 at 17:07

undur_gongor

15,657
5
63
75

Holger Just · Answer 2 · 2018-03-20T12:39:17.267

1

The gzip format uses a footer which contains checksums for previously compressed data. Once the footer is reached, there can't be any more data for the sames gziped data stream.

It seems the Ruby Gzip reader just finishes reading after the first encountered footer, which is technically correct, although many other implementations raise an eror if there is still more data. I don't really know about the exact behavior of Ruby here.

The point is, you can't just concatenate the raw byte streams and expect things to work. You have to actually adapt the streams and rewrite the headers and footers. See this question for details.

Or you could uncompress the streams, concatenate them and re-compress it, but that obviously creates some overhead...

edited Mar 20 '18 at 12:39

answered Jan 10 '12 at 16:04

Holger Just

52,918
14
115
123

I didn't write the log file. I am just trying to read it. I would like to uncompress both gz that have been concatenated. I would like to avoid recreating a third gz which the question you linked to is about. – Tihom Jan 10 '12 at 16:24
1

@Tihom: According to http://en.wikipedia.org/wiki/Gzip, concatenating several GZIP files is perfectly valid: "Although its file format also allows for multiple such streams to be concatenated (zipped files are simply decompressed concatenated as if they were originally one file), ..." Of course, this is something different than compressing to files in one GZIP archive. – undur_gongor Jan 10 '12 at 16:58
1

This answer is not correct. The gzip specification in RFC 1952 explicitly states that gzip streams _can_ be "just" concatenated to make a valid gzip stream, and that a compliant decompressor must decompress all of them. – Mark Adler Mar 20 '18 at 20:32
Still (at least at the time of writing the answer), Ruby ignored any trailing data after the first stream. – Holger Just Mar 21 '18 at 09:01

score 0 · Answer 3 · answered Sep 10 '13 at 20:22

The accepted answer didn't work for me. Here's my modified version. Notice the different usage of gz.unused.

Also, you should call finish on the GzipReader instance to avoid memory leaks.

# gzcat-test.rb
require 'zlib'
require 'stringio'
require 'digest/sha1'

# gzip -c /usr/share/dict/web2 /usr/share/dict/web2a > web-cat.gz
io = File.open('web-cat.gz')
# or, if you don't care about memory usage:
# io = StringIO.new File.read 'web-cat.gz'

# these will be hashes: {orig_name: 'filename', data_arr: unpacked_lines}
entries=[]
loop do
  entries << {data_arr: []}
  # create a reader starting at io's current position
  gz = Zlib::GzipReader.new(io)
  entries.last[:orig_name] = gz.orig_name
  gz.each {|l| entries.last[:data_arr] << l }
  unused = gz.unused  # save this before calling #finish
  gz.finish

  if unused
    # Unused is not the entire remainder, but only part of it.
    # We need to back up since we've moved past the start of the next entry.
    io.pos -= unused.size
  else
    break
  end
end

io.close

# verify the data
entries.each do |entry_hash|
  p entry_hash[:orig_name]
  puts Digest::SHA1.hexdigest(entry_hash[:data_arr].join)
end

Run:

> ./gzcat-test.rb
web2"
a62edf8685920f7d5a95113020631cdebd18a185
"web2a"
b0870457df2b8cae06a88657a198d9b52f8e2b0a

Our unpacked contents match the originals:

> shasum /usr/share/dict/web*
a62edf8685920f7d5a95113020631cdebd18a185  /usr/share/dict/web2
b0870457df2b8cae06a88657a198d9b52f8e2b0a  /usr/share/dict/web2a

score 0 · Answer 4 · answered Apr 03 '14 at 18:07

This is the correct way to ensure the whole file is read. Even though unused might be nil doesn't mean that the end of the origin gzipped file has been reached.

File.open(path_to_file) do |file|
  loop do
    gz = Zlib::GzipReader.new file
    puts gz.read

    unused = gz.unused
    gz.finish

    adjust = unused.nil? ? 0 : unused.length
    file.pos -= adjust
    break if file.pos == file.size
  end
end

Ruby parsing gzip binary string

4 Answers4