8

Packaging a folder on a SUSE Linux Enterprise Server 12 SP3 system using GNU tar 1.30 always gives different md5 checksums although the file contents do not change.

I run tar to package my folder that contains a simple text file:

tar cf package.tar folder

Nevertheless, although the content is exactly the same, the resulting tar always has a different md5 (or sha1) checksum:

$> rm -rf package.tar && tar cf package.tar folder && md5sum package.tar
e6383218596fffe118758b46e0edad1d  package.tar
$> rm -rf package.tar && tar cf package.tar folder && md5sum package.tar
1c5aa972e5bfa2ec78e63a9b3116e027  package.tar

Because the linux file system seems to deliver files in a random order to tar, I tried using the --sort option. But the resulting command doesn't change the checksum issue for me. Also tar's --mtime option does not help here, since the creation dates are exactly the same.

I appreciate any help on this.

Robert
  • 1,710
  • 2
  • 18
  • 35
  • ...last access time of the file? Maybe? – linuxfan says Reinstate Monica Oct 05 '18 at 15:21
  • Could the permissions be changed? Look at [this](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_06) to see what the TAR header contains. – DaBler Oct 05 '18 at 15:40
  • Is file size the same for both the files? – DaBler Oct 05 '18 at 15:42
  • Can you unpack those two different archives and compare the folder content? – DaBler Oct 06 '18 at 09:39
  • @DaBler filesize and folder content are exactly the same for both versions of the file. – Robert Oct 09 '18 at 09:27
  • @Robert: and what about the metadata (mtime, permissions, owner)? – DaBler Oct 10 '18 at 06:49
  • @Robert: If you could share the two TAR archives (or their parts), I can compare them for you. – DaBler Oct 10 '18 at 06:53
  • @DaBler: I appreciate your help on this one. You can find two sample tars here: https://github.com/robertfoobar/tar-checksum. I created them using the options suggested by Michael. Md5 Checksums should be c33631c5086593eade0733c1913f0c0e and 67ce66b99249f3401b4e3649f285d875 – Robert Oct 12 '18 at 15:12
  • @Robert: On my side: $ md5sum run1/assets.tar run2/assets.tar 67ce66b99249f3401b4e3649f285d875 run1/assets.tar 67ce66b99249f3401b4e3649f285d875 run2/assets.tar – DaBler Oct 12 '18 at 15:38
  • @Robert: Also `diff` confirms that the files are the same. – DaBler Oct 12 '18 at 15:39
  • @DaBler Seems I copied one file version twice. I just updated the repo. Now the files are different. $ md5sum run1/assets.tar run2/assets.tar 67ce66b99249f3401b4e3649f285d875 *run1/assets.tar 84d0717d1d72f0f72331d74f0d36514c *run2/assets.tar – Robert Oct 15 '18 at 11:37

3 Answers3

8

The archives you provided contain pax extended headers. A quick glance at their structure reveals that they differ in these two fields:

  1. The process ID of the pax process (as part of a name for the extended header in the ustar header block, and consequently the checksum for this ustar header block).
  2. The atime (access time) in the extended header.

One of the workarounds you can use for reproducible archive creation is to enforce the old unix ustar format (rather than the pax/posix format):

tar --format=ustar -cf package.tar folder

The other choice is to manually set the extended name and delete the atime while preserving the pax format:

tar --format=pax --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime -cf package.tar folder

Now the md5sum should be the same for both archives.

DaBler
  • 2,695
  • 2
  • 26
  • 46
5

The header for tar files contain several fields which will be potentially different each time you re-tar a set of files. For instance the last access time and modification time will likely be different each time.

According to this article it is possible with GNU tar to produce identical output for identical input by doing the following:

# requires GNU Tar 1.28+
$ tar --sort=name \
      --mtime="2018-10-05 00:00Z" \
      --owner=0 --group=0 --numeric-owner \
      -cf product.tar build
Michael Powers
  • 1,970
  • 1
  • 7
  • 12
  • Thanks for your input Michael. Unfortunately it doesn't work for me. It seems like additional header information is still impacting the tars checksum. I also tried to touch all files ```touch --no-dereference -t "201810050000" ``` before tarring. FYI I realized that my folder contains symlinks which point to locations inside the folder. So IMHO shouldn't make a difference. – Robert Oct 09 '18 at 12:04
  • 1
    In that case you'll have to do a [binary diff](https://superuser.com/questions/125376/how-do-i-compare-binary-files-in-linux) between two different tar's to see what's different. There's very likely some file metadata getting scooped up and put into your tar that you don't want. Once you find out where there's a difference you can use the [tar spec](https://www.gnu.org/software/tar/manual/html_node/Standard.html) to figure out which field is different. – Michael Powers Oct 09 '18 at 12:14
1

tar -p --sort=name --no-acls --no-selinux --no-xattrs worked for a similar situation in slackware 14.2, using GNU tar 1.29.
The p stands for preserve attributes (owner and time) and is assumed for a root user.
Also consider untarring with --atime-preserve (depending on purpose).

ouflak
  • 2,458
  • 10
  • 44
  • 49
themanjay
  • 11
  • 1