43

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.

However, There are some restrictions: first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work. Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)

Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?

Thanks,

myjpa
  • 433
  • 1
  • 4
  • 7
  • cksum prints CRC checksum and byte count for a tarball – mechanical_meat Jun 23 '09 at 04:02
  • I agree with Adam's comment above, but I would add and maybe it's just me but I would get the disk space needed to untar them. – NoahD Jun 23 '09 at 04:12
  • 1
    I think cksum won't work since meta-data are take into account when calculating CRC. And byte counts equal does not necessarily indicate the file contents are identical. – myjpa Jun 23 '09 at 05:16

12 Answers12

23

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):

pkgdiff PKG-0.tgz PKG-1.tgz

enter image description here

enter image description here

linuxbuild
  • 15,843
  • 6
  • 60
  • 87
12

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.


If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.


One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!

execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.

Just remember that this last scheme does not really do checksum.

Sample tar and output (all files are zero size in this example).

$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/

Command to generate sorted name/size list

$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/

You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

nik
  • 13,254
  • 3
  • 41
  • 57
  • Thanks a lot, but I have no control of the creation of the tarballs:( – myjpa Jun 23 '09 at 06:50
  • Thats unfortunate. But, you have a Python solution. And, it saves you from the disk space utilization of extraction. My other two solutions would be useful as heuristic methods which can be tried when you want speed. – nik Jun 23 '09 at 07:29
  • Infact, if you suspect the two archives to be different with high likelihood, then for fast results, you could use the last solution suggested in my answer. Because, this will always catch files added/removed and if a file changes its size typically changes too. – nik Jun 23 '09 at 07:36
  • Yes, I agree. This is a quick approach to tell if file number/size changes. – myjpa Jun 24 '09 at 03:25
  • 2
    You can also pipe the outputs of two such commands directly to a diff tool e.g.: meld <(tar tvfj ... | awk ...) <(tar tvfj ... | awk ...) – Raman Feb 06 '13 at 14:29
7

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Greg Smith
  • 16,965
  • 1
  • 34
  • 27
  • Yes, I think it is helpful, the code is so straightforward. Only I have to use python. – myjpa Jun 23 '09 at 06:48
  • 2
    Doing a comparison between two tarballs requires creating a pair of lists of (file,md5) entries and computing the difference between the two lists. That's just really painful to write in straight shell, while trivial to do in Python or Perl. That's why you're unlikely to first a straight shell answer here--it's exactly the kind of problem that motivated creating those languages. If you don't want to go completely crazy writing this thing, you'd really be far better off to start with tarsum (or the tardiff Perl code) and customize it for your specific needs than to use straight shell. – Greg Smith Jun 24 '09 at 03:10
  • Just FYI, the latest tarsum in the link seems broken for me, at least on mac. (Has a compatible_mode option that is somewhat broken and I had to remove.) – Marcus Aug 18 '14 at 20:10
  • With @Marcus comment of broken package still unaddressed 5 years later it makes it hard to upvote this answer. Notice upvotes for other answers are 11. Additionally there are no screenshots here to see what you are getting yourself into. – WinEunuuchs2Unix Jun 22 '19 at 18:18
  • The small python script `tarsum` was posted on github https://github.com/mikemccabe/code/blob/master/tarsum and a version adjusted for python3 is on this gist: https://gist.github.com/sjmurdoch/5e089249bc465706f1ca32f195787ad8 . The latter worked perfectly for me on Xubuntu 19.10 – Stéphane Gourichon Jan 05 '20 at 19:04
7

Here is my variant, it is checking the unix permission too:

Works only if the filenames are shorter than 200 char.

diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)
user1126070
  • 5,059
  • 1
  • 16
  • 15
6

EDIT: See the comment by @StéphaneGourichon

I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:

tar -xOzf archive.tar.gz | sort | sha1sum

Note that the order of the arguments is important; particularly O which signals to use stdout.

Arran Schlosberg
  • 380
  • 2
  • 11
  • 1
    This method depends on the order of files in the archive. For example, two consecutive Ubuntu daily build tarballs may have same file content while the order of files are not same. – youfu May 24 '17 at 05:19
  • 4
    `tar -x0zf` dumps the entire contents of the archive, and then `sort` puts all the lines in order, which doesn't fix the "order of files in the archive" problem, but adds a new problem by mixing up the lines. The archives could differ by a line swap and it wouldn't be caught. Instead, get the list of files, omit directories, sort the list, and tell `tar` to extract in exactly that order: `tar -xOzf archive.tar.gz \`tar -tzf archive.tar.gz | sed '/\/$/d' | sort\` | sha1sum` – Roger Dueck Apr 23 '18 at 16:06
  • why wouldn't `sha1sum archive.tar.gz` work just like that? – Alexander Mills Aug 04 '18 at 22:17
  • All these approaches depend on extracting the compressed-tarball. Which is proportional to the size of the tarball. – nik Aug 05 '18 at 03:28
  • @AlexanderMills, checksum on the `tar.gz` is different from the checksum on its contents. This would not work right if meta-data of files changed across the tar-balls (but they were in fact identical otherwise). – nik Aug 05 '18 at 03:33
  • 2
    Using "sort" in a pipe practically requires to hold all the decompressed content archive content *memory*. If the archive are so big that the OP cannot offer to write them to *disk*, this is bound to fail. Anyway, this is, as indicated by other comments, more work for the machine. I used tarsum from the accepted answer by @GregSmith and am very happy with it. – Stéphane Gourichon Jan 05 '20 at 19:08
  • @RogerDueck your solution is bugged, it gave the same result unlike the answer's version, when a tar.xz had only 1 file same with a second tar.xz but there were 2 extra non-empty files on the second file. – j riv May 06 '22 at 11:03
3

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

gertvdijk
  • 24,056
  • 6
  • 41
  • 67
Evan
  • 425
  • 2
  • 10
  • 3
    Looking at the implementation, it untars contends of the file into a temp directory, so it doesn't quite solve his problem :/ – Charles Ma Jun 23 '09 at 04:01
  • Additionally, `tardiff` was reporting errors for being unable to `rm` what it extracted into `/tmp/tardiff-*` which makes it even worse if you're working in a tight environment. – Alastair May 09 '13 at 14:40
  • AIUI tardiff by default only checks if the list of filenames differ, not if the files themselves do. – plugwash Apr 24 '20 at 14:58
3

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).

pip install diffoscope
Kuchara
  • 622
  • 7
  • 16
1

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).

go get github.com/VonC/gtarsum

It will read a tar file, and:

  • sort the list of files alphabetically,
  • compute a SHA256 for each file content,
  • concatenate those hashes into one giant string
  • compute the SHA256 of that string

The result is a "global hash" for a tar file, based on the list of files and their content.

It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • can the concatenation step be ommited to see what is different not only that it is different? BTW: I tried pkgdiff in this context and stumbled over comparing archives containing bare git repositories. Beeing a git expert maybe you know if there is such a tool like git_diff_bares :) – grenix Nov 29 '21 at 18:26
  • @grenix I did not implemented a detailed diff, because that would involve dealing with possibly large list to display. Comparing two bare repositories would simply involve, at least, comparing the SHA1 of their branches: not the same SHA, not the same repo. – VonC Nov 29 '21 at 22:29
1

Just throwing this out there since none of the above solutions worked for what I needed.

This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.

I know it's not as performant as others, but it provides the certainty I needed.

PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
    tar -xvf $template --to-command=md5sum | 
        grep $PATH_TO_CHECK -A 1 | 
        grep -v $PATH_TO_CHECK | 
        awk '{print $1}' | 
        md5sum | 
        awk "{print \"$template\",\$1}"
done

*note: An invalid path simply returns nothing.

SgtPooki
  • 11,012
  • 5
  • 37
  • 46
0

If not extracting the archives nor needing the differences, try diff's -q option:

diff -q 1.tar 2.tar

This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

Alastair
  • 6,837
  • 4
  • 35
  • 29
  • 2
    This will tell whether two tar files are exactly identical or not but the OP is looking for a way to compare two tarfiles discarding owner/group/timestamp data, effectively the OP wants to know whether the files inside the tarballs are the same. – dreamlax May 27 '13 at 06:15
0

There is tool called archdiff. It is basically a perl script that can look into the archives.

Takes two archives, or an archive and a directory and shows a summary of the
differences between them.
cmcginty
  • 113,384
  • 42
  • 163
  • 163
0

I have a similar question and i resolve it by python, here is the code. ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you

import zipfile
import os,md5
import hashlib
import shutil

def decompressZip(zipName, dirName):
    try:
        zipFile = zipfile.ZipFile(zipName, "r")
        fileNames = zipFile.namelist()
        for file in fileNames:
            zipFile.extract(file, dirName)
        zipFile.close()
        return fileNames
    except Exception,e:
        raise Exception,e

def md5sum(filename):
    f = open(filename,"rb")
    md5obj = hashlib.md5()
    md5obj.update(f.read())
    hash = md5obj.hexdigest()
    f.close()
    return str(hash).upper()

if __name__ == "__main__":
    oldFileList = decompressZip("./old.zip", "./oldDir")
    newFileList = decompressZip("./new.zip", "./newDir")

    oldDict = dict()
    newDict = dict()

    for oldFile in oldFileList:
        tmpOldFile = "./oldDir/" + oldFile
        if not os.path.isdir(tmpOldFile):
            oldFileMD5 = md5sum(tmpOldFile)
            oldDict[oldFile] = oldFileMD5

    for newFile in newFileList:
        tmpNewFile = "./newDir/" + newFile
        if not os.path.isdir(tmpNewFile):
            newFileMD5 = md5sum(tmpNewFile)
            newDict[newFile] = newFileMD5

    additionList = list()
    modifyList = list()

    for key in newDict:
        if not oldDict.has_key(key):
            additionList.append(key)
        else:
            newMD5 = newDict[key]
            oldMD5 = oldDict[key]
            if not newMD5 == oldMD5:
            modifyList.append(key)

    print "new file lis:%s" % additionList
    print "modified file list:%s" % modifyList

    shutil.rmtree("./oldDir")
    shutil.rmtree("./newDir")
Anton Samsonov
  • 1,380
  • 17
  • 34