I have two *.tar
files with similar contents. I want to verify which files are the same. A lot of the files are big so I comparing hashes would require extracting every file from each tar and computing the hash. Is there a way to hash files in a tar without having to extract it? Is there another way to compare files across two *.tar
files?

- 193
- 2
- 2
- 9
2 Answers
If it's GNU tar, run this:
tar -xf file1.tar --to-command=file-stats-from-tar
where file-stats-from-tar is somewhere in $PATH
and is:
#!/bin/bash
md5=`md5sum`;
md5=${md5%% *}
printf "%s\t%s\n" $md5 "$TAR_FILENAME"
Change md5sum
if you need to.
This does it all in a single pass.
How it works is that the --to-command
option tells tar to send each file separately to the command you specify, with a bunch of environment variables set (we only use TAR_FILENAME
here).
-
I liked your answer more. I wasn't aware of the to-command option of tar before. Thanks! I tried to pack it down to a single line to avoid having to make helper scripts. I couldn't seem to get the variable expansions to all work easily, so I used a pipe through awk to remove the extra '-' column: `tar -xf test.tar --to-command='/bin/sh -c "echo $(md5sum | awk '\''{print $1}'\'') $TAR_FILENAME"'` – JustinB Jan 27 '20 at 15:52
-
Here's a cleaner/shorter one: ``tar -xf test.tar --to-command='sh -c "echo $(md5sum | colrm 35) $TAR_FILENAME"'`` – JustinB Jan 27 '20 at 16:21
-
Oh good one. (In my case, I use a much more complex version of that script to compare file mode, user/group, and timestamp also, so I just pared down what I had without thinking it could be a one-liner in this reduced form!) – Jan 27 '20 at 17:22
There may be more efficient ways, but I was able to come up with this in a few moments:
tar tf test.tar | while read x ; do echo "$(tar xfO test.tar ${x} | md5sum) ${x}" ; done
The first tar tf
just lists the files in the archive, which is then passed into the while read x
bash loop. For each filename, it then finds the hash using tar xfO test.tar ${x} | md5sum
You could obviously replace md5sum with your preferred hash tool. The weird use of echo $() ${x}
is just to keep the output similar to a regular hash output with the values on the left and filenames on the right. Without that it just give you the hashes of all the files but no names, so you can't tell which went to which. Even with it there is a extra column of -
in the output that isn't normally there. It could be easily removed with a colrm
command in the pipeline.
This might not be the most efficient since it has to go through the tar file n+1 times if there are n files in it, but hopefully the tar contents are cached after the first read through.

- 111
- 3
-
I continued looking for other methods and think there might be a way using a tool like AVFS. I'm not as excited about that solution since it isn't installed on most systems, but I'm still thinking about it... – JustinB Jan 26 '20 at 00:21