Is there a safe way to run a diff on two zip compressed files?

Question

Seems this would not be a deterministic thing, or is there a way to do this reliably?

What are you wanting a diff of? The file listing (FileA exists in one but not the other). The files' contents (FileB in the first zip has these modifications compared to the FileB in the second zip). Or all of the above? eduffy's answer may work (in Linux) if you don't care about the contents. — JMD, Feb 25 '09 at 19:33
If you just care if the zipped files are the same then why not compare hashes? — EBGreen, Feb 25 '09 at 19:34
This is humorous. Someone asks a programming question and gets a lot of non programming answers. :) — EBGreen, Feb 25 '09 at 19:55
@Apple - You should probably post the technologies that you want to do this with. Specifically the platform and programming language that you plan to use. — EBGreen, Feb 25 '09 at 20:02
Is it a compressed file or archive/directory/folder? (There are different types of zip: gzip does single files, and works with tar to compress archives; pkzip does both in one program; etc ) — ctrl-alt-delor, Dec 03 '16 at 11:02

score 38 · Answer 1 · answered Feb 25 '09 at 19:29

38

If you're using gzip, you can do something like this:

# diff <(zcat file1.gz) <(zcat file2.gz)

answered Feb 25 '09 at 19:29

eduffy

39,140
13
95
92

Well I need to do this programmatically, and I'm not running in a unix environment (unfortunately). – ApplePieIsGood Feb 25 '09 at 19:39
5

how is the solution in this answer not "programmatically" solving your problem? – Feb 25 '09 at 20:15
6

This is great to know about (I never knew you could pipe in two program streams to another program without making temporary files.) I was confused and running into bugs, though, until I realized you **cannot have a space between the < and the paren.** – Joshua Goldberg Aug 22 '13 at 14:59
2

Note that it also works with zipped files: ``diff <(zcat file1.zip) <(zcat file2.zip)`` – galath Jan 02 '17 at 15:03
3

Also note that the `<(someCommand)` syntax is not in [POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html). In [GNU Bash](https://www.gnu.org/software/bash/) this along with the `>(someCommand)` syntax is called [Process Substitution](https://www.gnu.org/software/bash/manual/html_node/Process-Substitution.html#Process-Substitution) and is not available on all platforms. – jotik Mar 23 '17 at 11:31

score 7 · Answer 2 · answered Feb 25 '09 at 19:28

7

Reliable: unzip both, diff.

I have no idea if that answer's good enough for your use, but it works.

answered Feb 25 '09 at 19:28

Devin Jeanpierre

92,913
4
55
79

I'm looking to avoid opening and expanding and diffing, it could be more expensive. – ApplePieIsGood Feb 25 '09 at 19:38
Unfortunately, it's the only reliable way to do it. – Powerlord Feb 26 '09 at 16:15
1

@Powerlord: out of curiosity is eduffy's answer unreliable? Or just later than your comment? – orangepips Jan 28 '13 at 17:51
@orangepips It's still unzipping then diffing, with the added restriction that it's specific to gzip. Besides which, chaos's answer is a better solution for gzip-specific. – Powerlord Jan 29 '13 at 14:55

score 6 · Answer 3 · answered Jan 18 '17 at 15:53

6

zipcmp compares the zip archives zip1 and zip2 and checks if they contain the same files, comparing their names, uncompressed sizes, and CRCs. File order and compressed size differences are ignored.

sudo apt-get install zipcmp

answered Jan 18 '17 at 15:53

Wender

991
15
24

Could you please explain me the output returned by running `zipcmp` I got an entry line like `- 2380 d0c49aea c5-custom-product-5.2.0/wso2/runtime2/bin/bootstrap/logging.properties`. I know `-` indicates the relavant zip file but what is indicated by `2380` and `d0c49aea`. Thanks – Kasun Siyambalapitiya Aug 31 '17 at 00:52
"2380" = zip of entry; "d0c49aea" is md5 of entry; "c5-custom-product-5.2.0/wso2/runtime2/bin/bootstrap/logging.properties" entry name. Look at md5, entry can be same size but different content – Wender Jan 31 '18 at 15:52

score 6 · Answer 4 · answered Feb 26 '09 at 16:07

In general, you cannot avoid decompressing and then comparing. Different compressors will result in different DEFLATEd byte streams, which when INFLATEd result in the same original text. You cannot simply compare the DEFLATEd data, one to another. That will FAIL in some cases.

But in a ZIP scenario, there is a CRC32 calculated and stored for each entry. So if you want to check files, you can simply compare the stored CRC32 associated to each DEFLATEd stream, with the caveats on the uniqueness properties of the CRC32 hash. It may fit your needs to compare the FileName and the CRC.

You would need a ZIP library that reads zip files and exposes those things as properties on the "ZipEntry" object. DotNetZip will do that for .NET apps.

score 4 · Answer 5 · answered Dec 19 '13 at 12:07

Actually gzip and bzip2 both come with dedicated tools for doing that.

With gzip:

$ zdiff file1.gz file2.gz

With bzip2:

$ bzdiff file1.bz2 file2.bz2

But keep in mind that for very large files, you might run into memory issues (I originally came here to find out about how to solve them, so I don't have the answer yet).

score 3 · Answer 6 · answered Feb 25 '09 at 19:30

3

Beyond compare has no problem with this.

answered Feb 25 '09 at 19:30

Lieven Keersmaekers

57,207
13
112
146

I wonder if they expand it behind the scenes and diff? That's the thing, hard to say with an app what it does. – ApplePieIsGood Feb 25 '09 at 19:39
I'm pretty sure they expand behind the scenes. They have to to be able to show a side-by-side diff of two files from the zip archives. – Lieven Keersmaekers Feb 25 '09 at 19:42
It is proprietary, so who knows what it does? – ctrl-alt-delor Dec 03 '16 at 11:04
@Richard: you should reserve downvotes for answers that are wrong. The question was how to diff two zip compressed files. Beyond Compare might not be the answer you like but it's not wrong. – Lieven Keersmaekers Dec 03 '16 at 17:12
1

BC works with the zip file directly; it doesn't need to extract everything. Zips store file CRCs as part of the file header, so for "CRC" or "rules-based" comparisons we can compare many files without decompressing anything. For "binary" compares, when checking for "similar" files in a rules-based compare, or when opened in the file viewer, individual files will be decompressed. Small files are handled entirely in memory, large files may be stored in a temp directory. – Zoë Peterson Jan 18 '17 at 17:48

score 2 · Answer 7 · answered Dec 13 '10 at 13:17

This isn't particularly elegant, but you can use the FileMerge application that comes with Mac OS X developer tools to compare the contents of zip files using a custom filter.

Create a script ~/bin/zip_filemerge_filter.bash with contents:

#!/bin/bash
##
#  List the size, CR-32 checksum, and file path of each file in a zip archive,
#  sorted in order by file path.
##
unzip -v -l "${1}" | cut -c 1-9,59-,49-57 | sort -k3
exit $?

Make the script executable (chmod +x ~/bin/zip_filemerge_filter.bash).

Open FileMerge, open the Preferences, and go to the "Filters" tab. Add an item to the list with: Extension:"zip", Filter:"~/bin/zip_filemerge_filter.bash $(FILE)", Display: Filtered, Apply*: No. (I've also added the filer for .jar and .war files.)

Then use FileMerge (or the command line "opendiff" wrapper) to compare two .zip files.

This won't let you diff the contents of files within the zip archives, but will let you quickly see which files appear within one only archive and which files exist in both but have different content (i.e. different size and/or checksum).

ilvez · Answer 8 · 2015-12-23T12:23:33.990

1

I found relief with this simple Perl script: diffzips.pl

It recursively diffs every zip file inside the original zip, which is especially useful for different Java package formats: jar, war, and ear.

zipcmp uses more simple approach and it doesn't recurse into archived zips.

edited Dec 23 '15 at 12:23

answered Dec 23 '15 at 12:16

ilvez

1,235
3
16
27

score 1 · Answer 9 · edited Jan 10 '18 at 14:48

A python solution for zip files:

import difflib
import zipfile

def diff(filename1, filename2):
    differs = False

    z1 = zipfile.ZipFile(open(filename1))
    z2 = zipfile.ZipFile(open(filename2))
    if len(z1.infolist()) != len(z2.infolist()):
        print "number of archive elements differ: {} in {} vs {} in {}".format(
            len(z1.infolist()), z1.filename, len(z2.infolist()), z2.filename)
        return 1
    for zipentry in z1.infolist():
        if zipentry.filename not in z2.namelist():
            print "no file named {} found in {}".format(zipentry.filename,
                                                        z2.filename)
            differs = True
        else:
            diff = difflib.ndiff(z1.open(zipentry.filename),
                                 z2.open(zipentry.filename))
            delta = ''.join(x[2:] for x in diff
                            if x.startswith('- ') or x.startswith('+ '))
            if delta:
                differs = True
                print "content for {} differs:\n{}".format(
                    zipentry.filename, delta)
    if not differs:
        print "all files are the same"
        return 0
    return 1

Use as

diff(filename1, filename2)

It compares files line-by-line in memory and shows changes.

score 0 · Answer 10 · answered Jul 12 '17 at 19:00

I generally use an approach like @mrabbit's but run 2 unzip commands and diff the output as required. For example I need to compare 2 Java WAR files.

$ sdiff --width 160 \
   <(unzip -l -v my_num1.war | cut -c 1-9,59-,49-57 | sort -k3) \
   <(unzip -l -v my_num2.war | cut -c 1-9,59-,49-57 | sort -k3)

Resulting in output like so:

--------          -------                                                       --------          -------
Archive:                                                                        Archive:
-------- -------- ----                                                          -------- -------- ----
48619281          130 files                                                   | 51043693          130 files
    1116 060ccc56 index.jsp                                                         1116 060ccc56 index.jsp
       0 00000000 META-INF/                                                            0 00000000 META-INF/
     155 b50f41aa META-INF/MANIFEST.MF                                        |      155 701f1623 META-INF/MANIFEST.MF
 Length   CRC-32  Name                                                           Length   CRC-32  Name
    1179 b42096f1 version.jsp                                                       1179 b42096f1 version.jsp
       0 00000000 WEB-INF/                                                             0 00000000 WEB-INF/
       0 00000000 WEB-INF/classes/                                                     0 00000000 WEB-INF/classes/
       0 00000000 WEB-INF/classes/com/                                                 0 00000000 WEB-INF/classes/com/
...
...

score 0 · Answer 11 · answered Jun 04 '18 at 02:46

I gave up trying to use existing tools and wrote a little bash script that works for me:

#!/bin/bash
# Author: Onno Benschop, onno@itmaze.com.au
# Note: This requires enough space for both archives to be extracted in the tempdir

if [ $# -ne 2 ] ; then
  echo Usage: $(basename "$0") zip1 zip2
  exit
fi

# Make temporary directories
archive_1=$(mktemp -d)
archive_2=$(mktemp -d)

# Unzip the archives
unzip -qqd"${archive_1}" "$1"
unzip -qqd"${archive_2}" "$2"

# Compare them
diff -r "${archive_1}" "${archive_2}"

# Remove the temporary directories
rm -rf "${archive_1}" "${archive_2}"

score 0 · Answer 12 · answered Feb 25 '09 at 19:48

0

WinMerge (windows only) has lots of features and one of them is:

Archive file support using 7-Zip

answered Feb 25 '09 at 19:48

RuudKok

5,252
2
26
27

Hashbrown · Answer 13 · 2020-01-23T01:36:46.020

A lot of the solutions here are either only checking the CRC to see if differences exist, are complicated scripts, require uncompressing to disk, use external programs, or need specific compression formats other than the one you were asking about (zcat does NOT work with zip).

Here's one that's simple, easy to read, and should work wherever you have bash that shows the differences between the file contents _{if, like me, that's what you needed when you happened across this question}:

diff \
    <(zipinfo -1 "$zip1" '*' \
    | grep '[^/]$' \
    | sort \
    | while IFS= read -r file; do unzip -c "$zip1" "$file"; done \
    ) \
    <(zipinfo -1 "$zip2" '*' \
    | grep '[^/]$' \
    | sort \
    | while IFS= read -r file; do unzip -c "$zip2" "$file"; done \
    )

This decompresses in-memory, not to disk, releasing data from the pipe as it diffs (it wont decompress and then compare, so shouldn't use much memory).
Want to change diffing options for ignoring whitespace or using side-by-side? Change diff to diff -w or gvimdiff (this one will keep all files in memory) et cetera.
Say you only want to diff the .js files? Change * to *.js.
Only want to see the filenames that are missing from one or the other? Remove the while line and it wont bother decompressing.

Easy.

It will even safely handle (skip and record it to stderr) filenames with "illegal" characters like newlines and backslashes.
Doesn't get "safe"r than this.

slm's answer is pretty good for returning files that are different (without showing differences) and doesn't even decompress at all which is nice. If for some reason you want that but a step above CRC, in this answer you could add | sha512sum before the ; done for example and get 'the worst of both worlds' :P

Similarly it's relatively easy to compare an archive and a real directory:

diff \
    <(zipinfo -1 "$zip" '*' \
    | grep '[^/]$' \
    | sort \
    | while IFS= read -r file; do unzip -c "$zip" "$file"; done \
    ) \
    <(find "$directory" -type f -name '*' \
    | sort \
    | while IFS= read -r file
      do
          printf 'Archive:  %s\n  inflating: %s\n' "$directory" `echo $file | sed "s|$directory/||"`
          cat "$file"
          echo
      done \
    )

Or, ignoring files only in the directory, basically a handy dry-run of unzip -o -d "$directory":

diff \
    <(zipinfo -1 "$zip" '*' \
    | grep '[^/]$' \
    | sort \
    | while IFS= read -r file; do unzip -c "$zip" "$file"; done \
    ) \
    <(zipinfo -1 "$zip" '*' \
    | grep '[^/]$' \
    | sort \
    | while IFS= read -r file
      do
          printf 'Archive:  %s\n  inflating: %s\n' "$directory" "$file"
          cat "$directory/$file"
          echo
      done \
    )

Windows? Sorry. Whilst the scripts are simple and would be a cinch to port to the [syntactically] fantastic powershell, it wouldn't work. The native cmdlet only extracts to disk and MS still haven't fixed the broken binary data piping in PS so you cant "safely" use an external zip.exe in this manner either.

Apparenlty others have done similar things using the .NET API directly, but it'd become less of an elegant port and more of a reimplementation in .NET :|

_{A note about the "illegal filenames" mentioned before:

If you want it to work with these it actually isn't too difficult; you'll just need to swap $file with $(echo "$file" | sed 's/\\/\\\\/g;s/\^J/\n/g;s/\^M/\r/g').}

_{Add other ctrl chars as you happen across them.}

_{The reason is, for some reason, even though zipinfo displays a filename with \n in it as ^J, it will not accept these safe names for unzip, only the original! And even though it CAN extract to those illegal filenames with unzip -^ there's no way to get these original filenames through zipinfo at all. So you need to build the original, illegal filename from the safe, unusable one to reference them for the diff :(

If you do this, note that there is no way to distinguish between ^J literally and \n displaying as ^J, and that zip doesn't support / or ^@ within filenames at all.}

As a bonus; you can write all these diffs straight to an archive and keep them all in a folder heirarchy matching the original files instead of trying to read it all at once in one big splat.

(zipinfo -1 "$zip1"; zipinfo -1 "$zip2") \
    | grep '[^/]$' \
    | sort \
    | uniq \
    | while IFS= read -r file; do
        (diff <(unzip -p "$zip1" "$file") <(unzip -p "$zip2" "$file") | zip 'diff.zip' - \
        && zipinfo -s 'diff.zip' - | awk '{ print $4; }' | grep '[^0]' \
        && printf "@ -\n@=$file\n" | zipnote -w 'diff.zip' \
        || zip -d 'diff.zip' -
        ) >/dev/null
      done

Not as pretty a script, but now you can open it up in your gui archiver of choice or do unzip -p diff.zip some/dir/some.file to see the differences with that file specifically, or be greeted with "not found" if there are no differences, which is much prettier in practice.

Is there a safe way to run a diff on two zip compressed files?

13 Answers13

Linked

Related