md5sum of rubyzip output string and written file differs

Question

I'm using rubyzip-1.2.0 with ruby 2.2.1 to generate a zip file containing a single file (in this case, a python script). The content file does not change, and the md5sum of the generated zip string remains the same, but once I write and then read the zip string to file, the length increases and the md5sum is different every time. This happens whether I use File.open(zip_file, 'wb') {} or IO.binwrite(zip_file, zip_string).

Just to add to the excitement, on OS X, the zip string and written file sizes are different (and of course, the md5sums differ), but on Ubuntu 14.04, the size remains consistent and the md5sums differ.

If I generate the file multiple times without pause, the checksums are (generally) the same; if I put in the sleep, they differ, which makes me wonder if rubyzip is writing some timestamp of some sort to the file?

I'm probably just missing some nuance of ruby binary file handling.

require 'zip'
require 'digest'

def update_zip_file(source_file)
  zip_file = source_file.sub(/py$/, 'zip')
  new_zip = create_lambda_zip_file(source_file)
  puts "Zip string length: #{new_zip.length}"
  md5_string = Digest::MD5.new
  md5_string.update IO.binread(zip_file)
  puts "Zip string MD5: #{md5_string.hexdigest}"
  File.open(zip_file, 'wb') do |f|
    puts "Updating #{zip_file}"
    f.write new_zip
  end
  puts "New file size: #{File.size(zip_file)}"
  md5_file_new = Digest::MD5.new
  md5_file_new.update IO.binread(zip_file)
  puts "New file MD5: #{md5_file_new.hexdigest}"
end

def create_lambda_zip_file(source_file)
  zip_file = source_file.sub(/py$/, 'zip')
  zip = Zip::OutputStream.write_buffer do |zio|
    zio.put_next_entry(File.basename(source_file))
    zio << File.read(source_file)
  end
  zip.string
end

(1..3).each do
  update_zip_file('test.py')
  sleep 2
end

Output on OS X:

Zip string length: 973
Zip string MD5: 2578d03cecf9539b046fb6993a87c6fd
Updating test.zip
New file size: 1019
New file MD5: 03e0aa2d345cac9731d1482d2674fc1e
Zip string length: 973
Zip string MD5: 03e0aa2d345cac9731d1482d2674fc1e
Updating test.zip
New file size: 1019
New file MD5: bb6fca23d13f1e2dfa01f93ba1e2cd16
Zip string length: 973
Zip string MD5: bb6fca23d13f1e2dfa01f93ba1e2cd16
Updating test.zip
New file size: 1019
New file MD5: 3d27653fa1662375de9aa4b6d2a49358

Output on Ubuntu 14.04:

Zip string length: 1020
Zip string MD5: 4a6f5c33b420360fed44c83f079202ce
Updating test.zip
New file size: 1020
New file MD5: 0cd8a123fe7f73be0175b02f38615572
Zip string length: 1020
Zip string MD5: 0cd8a123fe7f73be0175b02f38615572
Updating test.zip
New file size: 1020
New file MD5: 0a010e0ae0d75e5cde0c4c4ae098d436
Zip string length: 1020
Zip string MD5: 0a010e0ae0d75e5cde0c4c4ae098d436
Updating test.zip
New file size: 1020
New file MD5: e91ca00a43ccf505039a9d70604e184c

Any explanation or workaround? I want to make sure the zip file contents differ before rewriting the file.

Edited to fix file md5sum and update output.

EDIT And in fact rubyzip does put the current timestamp in each entry (why?). If I monkey patch it so I can manipulate the entry attributes, the zip string's md5sum will remain constant.

module Zip
  class OutputStream
    attr_accessor :entry_set
  end

  class Entry
    attr_accessor :time
  end
end

...

def create_lambda_zip_file(source_file)
  zip_file = source_file.sub(/py$/, 'zip')
  zip = Zip::OutputStream.write_buffer do |zio|
    zio.put_next_entry(File.basename(source_file))
    zio << File.read(source_file)
    zio.entry_set.each {|e| puts e.time = Zip::DOSTime.at(File.mtime(source_file).to_i)}
  end
  zip.string
end

Does the encoding of the file change? For example from ISO8859 to UTF-8. — spickermann, Aug 05 '16 at 06:37

Frederick Cheung · Answer 1 · 2016-08-05T08:40:28.097

0

8caba7d65b81501f3b65eca199c28ace is the md5 sum of test.zip: you've md5'd the file name.

The difference in length is probably due to String#length returning the number of codepoints in the string whereas File.size is counting bytes. The String#bytesize method should return the same as the file check.

On my machine (OS X, ruby 2.3.1) the string returned from zip claimed to have the encoding utf-8 which explains why the length wasn't the same as the number of bytes. The string isn't actually valid UTF8 though - I'd consider this a bug. Either different versions or possibly locale related environment variables have resulted in your Linux machine not pretending the zip data is UTF8

Using force_encoding to change the encoding to ASCII-8BIT may help

edited Aug 05 '16 at 08:40

answered Aug 05 '16 at 07:39

Frederick Cheung

83,189
8
152
174

Oops, the md5sum of the filename rather than contents happened when I was condensing the code for posting. Will edit the original post. It also masked the original issue I saw, which was not only a change in length, but a moving target on the file content checksum. What's the proper encoding for the zip file? – Karen B Aug 05 '16 at 07:50
ASCII-8BIT (or its synonym binary) – Frederick Cheung Aug 05 '16 at 08:39

md5sum of rubyzip output string and written file differs

1 Answers1