8

I am currently working on a tool that uploads a group of files, then uses md5 checksums to compare the files to the last batch that were uploaded and tells you which files have changed.

For regular files this is working fine but some of the uploaded files are zip archives, which almost always have changed, even when the files inside it are the same.

Is there a way to perform a different type of checksum to check if these files have changed without having to unzip each one individually and then comparing the contents of each file individually.

Here is my current function

function check_if_changed($date, $folder, $filename)
{
  $dh = opendir('./wp-content/uploads/Base/');
  while (($file = readdir($dh)) !== false) {
    $folders[] = $file;
  }
  sort($folders);
  $position = array_search($date, $folders);
  $prev_folder = $folders[$position - 1];
  if ($prev_folder == '.' || $prev_folder == '..')
    { return true;}
  $newhash = md5_file('./wp-content/uploads/Base/'.$date.'/'.$folder.'/'.$filename);
  $oldhash = md5_file('./wp-content/uploads/Base/'.$prev_folder.'/'.$folder.'/'.$filename);
  if ($oldhash != $newhash){
    return true;
  }
  return false;
}
Kit Barnes
  • 725
  • 3
  • 11
  • 17

2 Answers2

9

Inside a zip archive, each "file" is stored with meta data like last modifcation time, filename, filesize in bytes, etc...and the important part - a crc32 checksum.

basically, you can just operate on the zip archive in a binary fashion, finding each file's meta data header and comparing the checksum to the previously stored checksums. You don't need to do any uncompressing to access the meta data in a zip archive. This would be extremely fast.

http://en.wikipedia.org/wiki/Zip_(file_format)

edit- actually, ZipArchive offers this functionality. See: http://www.php.net/manual/en/ziparchive.statindex.php

goat
  • 31,486
  • 7
  • 73
  • 96
  • OT: The python zip library lets you grab the crcs from zip files, you can also use the binascii module to calculate it for any arbitrary data. – Stuart Axon Jun 02 '14 at 15:38
  • Looks like checksum is `crc32b` which can be generated by `$newCrc = hexdec(hash_file("crc32b", "myPath/" . $name));` on a 64bit machine. – Dimitry K Dec 08 '14 at 10:34
0

You could extract only the file parts of the ZIP file and then hash them, but then you would have to remove the meta information, too!

So extracting the files is really the simplest solution.

ComFreek
  • 29,044
  • 18
  • 104
  • 156