4

I have two *.tar files with similar contents. I want to verify which files are the same. A lot of the files are big so I comparing hashes would require extracting every file from each tar and computing the hash. Is there a way to hash files in a tar without having to extract it? Is there another way to compare files across two *.tar files?

Stephen Rasku
  • 193
  • 2
  • 2
  • 9

2 Answers2

6

If it's GNU tar, run this:

tar -xf file1.tar --to-command=file-stats-from-tar

where file-stats-from-tar is somewhere in $PATH and is:

#!/bin/bash

md5=`md5sum`;
md5=${md5%% *}

printf "%s\t%s\n" $md5 "$TAR_FILENAME"

Change md5sum if you need to.

This does it all in a single pass.

How it works is that the --to-command option tells tar to send each file separately to the command you specify, with a bunch of environment variables set (we only use TAR_FILENAME here).

  • I liked your answer more. I wasn't aware of the to-command option of tar before. Thanks! I tried to pack it down to a single line to avoid having to make helper scripts. I couldn't seem to get the variable expansions to all work easily, so I used a pipe through awk to remove the extra '-' column: `tar -xf test.tar --to-command='/bin/sh -c "echo $(md5sum | awk '\''{print $1}'\'') $TAR_FILENAME"'` – JustinB Jan 27 '20 at 15:52
  • Here's a cleaner/shorter one: ``tar -xf test.tar --to-command='sh -c "echo $(md5sum | colrm 35) $TAR_FILENAME"'`` – JustinB Jan 27 '20 at 16:21
  • Oh good one. (In my case, I use a much more complex version of that script to compare file mode, user/group, and timestamp also, so I just pared down what I had without thinking it could be a one-liner in this reduced form!) –  Jan 27 '20 at 17:22
1

There may be more efficient ways, but I was able to come up with this in a few moments:

tar tf test.tar | while read x ; do echo "$(tar xfO test.tar ${x} | md5sum) ${x}" ; done

The first tar tf just lists the files in the archive, which is then passed into the while read x bash loop. For each filename, it then finds the hash using tar xfO test.tar ${x} | md5sum You could obviously replace md5sum with your preferred hash tool. The weird use of echo $() ${x} is just to keep the output similar to a regular hash output with the values on the left and filenames on the right. Without that it just give you the hashes of all the files but no names, so you can't tell which went to which. Even with it there is a extra column of - in the output that isn't normally there. It could be easily removed with a colrm command in the pipeline.

This might not be the most efficient since it has to go through the tar file n+1 times if there are n files in it, but hopefully the tar contents are cached after the first read through.

JustinB
  • 111
  • 3
  • I continued looking for other methods and think there might be a way using a tool like AVFS. I'm not as excited about that solution since it isn't installed on most systems, but I'm still thinking about it... – JustinB Jan 26 '20 at 00:21