4

I've been sent a HDD of new and updated files from an organisation that we are working with, but we already have most of the files sitting on our servers, and would like to update our local versions to match theirs.

Normally, this would be a job for something like rsync, but our problem is that the directory structure they provide is very poorly organised and we've had to rearrange their files in the past to work best with our systems.

So, my question is:

How can I find out which files in the set they have provided are new or different to the versions that we have, when the directory structures are different?

Once that question is answered, we can update the changed files, and work out where to put the new files on our system, probably somewhat manually.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
David Dean
  • 441
  • 1
  • 6
  • 11
  • possible duplicate of [Linux: Diff Two Directories?](http://serverfault.com/questions/190503/linux-diff-two-directories) – quanta Sep 10 '12 at 02:40
  • http://serverfault.com/questions/59108/how-to-compare-differences-between-directories-linux – quanta Sep 10 '12 at 02:40
  • 2
    Not a duplicate, files can have different names or be in different subdirectories, `diff` won't help here – Hubert Kario Sep 12 '12 at 17:54

2 Answers2

3

Ok, here is my first attempt at something. It seems to work moderately well for what I need, but I am open to better suggestions:

First, get md5sums of all the files in both our filesystem and the new data:

find /location/of/data -type f -exec md5sum {} ';' > our.md5sums
find /media/newdisk -type f -exec md5sum {} ';' > their.md5sums

And I wrote a short python script called md5diff.py:

#!/usr/bin/env python
import sys
print "Comparing", sys.argv[1], "to", sys.argv[2]

# Create a dictionary based upon the hashes in source B
dict = {}
for line in open(sys.argv[2]):
    p = line.partition(' ')
    dict[p[0]] = p[2].strip()


# Now go through source A and report where the file is in source B
for line in open(sys.argv[1]):
    p = line.partition(' ')
    if p[0] in dict:
        print line.strip(), "(", sys.argv[2], ":",dict[p[0]], ")"
    else:
        print line.strip(), "NOT IN", sys.argv[2]

So now I can use

./md5diff.py their.md5sums our.md5sums

And if I add in a | grep "NOT IN" it will only list the files on their media that we don't already have (or is different from what we have). From their I can start to manually attack the known differences.

David Dean
  • 441
  • 1
  • 6
  • 11
  • Seems like around about way. Couldn't you just do an ls and strip the paths to compare the file names only, rather than a hash of the whole file? – Mark Henderson Sep 10 '12 at 02:01
  • file names are not necessarily unique, and we need to know when a file changes (so the hash would change too). – David Dean Sep 10 '12 at 02:08
1

You don't have to MD5 to compare modification time changes. With that said, you could probably (barring a huge data set) copy the new and updated files to local storage, use a tool like fslint to identify duplicates, then use modification times (not just MD5sums) to reconcile everything else.

One important question is, how do you know if a file has been updated if the path isn't the same on the new storage? If file names aren't unique ("Sales Report August 2012.xls" could apply to many departments, for example), then how do you know when you are updating an existing file versus overwriting an existing file with unrelated content?

I would err on the side of caution and keep everything, file paths included. You can identify identical files and create symlinks to the originals for a poor man's deduplication system, but in reality your storage system should handle that for you. The worst-case scenario is trashing user data just to save space.

Joel E Salas
  • 5,572
  • 16
  • 25