0

I have the problem of needing to compare files in .tar.gz files to ensure none of the files within the gzip are duplicates. I am currently using ICSharpCode.SharpZipLib, which makes it easy to check for duplicates in Zip files since a ZipEntry has a "CRC" property. This is pretty straightforward, since I can get the crc and filesize, and use LINQ to find any files that match in hash and size and then throw an error or do whatever is necessary.

However, TarEntry has no such property or method, aside from the standard GetHashCode method, which to my understanding also computes the hash from the file metadata, hence copies of the same file do not have the same hash. Is there a way I can (quickly) compute the hash of the contents of the gzip files? Or is there another way to compare the contents?

ilyketurdles
  • 113
  • 1
  • 11
  • 1
    If you just want to check whether the `.tar.gz` file itself is a duplicate of some other `.tag.gz` file, then a `sha1` or `sha256` hash of the file should be enough. If you intend to open the archive and check each file individually, not sure what to suggest there, but a cryptographic hash would still be your best bet to determine equality of contents. – code_dredd Dec 08 '15 at 13:54
  • Yeah, I'm looking to compare each file inside the .tar.gz to make sure there are no duplicates within the file. Thanks for the suggestion though. I'll probably implement that later to check the .tar.gz files themselves. – ilyketurdles Dec 08 '15 at 15:01
  • Unfortunately, I don't see how you'd do what you want to do without first having to extract all the contents. It seems you'd need to uncompress + extract the archive and then process each individual file with a crypto hash, but you'd have to compare everything against everything else --a `O(n^2)` operation. I think what you're trying to do is a bad idea. Consider checking the archives directly. If you're worried about duplicates, then try to take care of it before they get created in the first place. – code_dredd Dec 08 '15 at 19:58

1 Answers1

0

First, if two files have different lengths, then right off the bat you know that they can't be equal. So use that for either zip or tar as your first filter.

Second, a hash will tell you if two files are different, but it can't tell you that they're the same. If equality is rare, then a hash is a good way to rule out most contenders for equality. Assuming that the hash values have already been computed. However if two hash values are equal, you then need to compare the files directly to see if they are equal.

If a hash has not already been computed, then it will usually be faster to skip computing a hash and simply compare files with equal length. The only way that would not be faster would be if you often have sets of files with the same length and common prefixes, so that they only differ some significant length into the file.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158