0

I have an archive of about 10 years worth of files that is a large directory tree structure, having multiple copies of smaller tries at various locations in the larger tree. The tree grew into this structure because of a lack of consistent backup strategy and filing strategy (all my own fault, basically).

I'm looking for a way to find identical copies of trees in the larger tree, such that I can delete the copies I don't need, moving me one step closer to cleaning this big mess up.

I thought I could write a script that would build a database of files in the tree, such that I could then write another script that finds trees that are identical, deleting the tree copy that is nested deepest in the tree.

However, I'm not sure how to best go about this, in terms of database design and what sort of algorithm to use to efficiently compare these trees to find identical copies.

To recap, this is what the tree looks like:

backups/folder1/ backups/somecomputer/vault/folder1 backups/othercomputer/folder1 ...

There is no guarantee that the trees are "complete" - it could be that the trees are similar but that only one copy of trees contains most files and subdirectories. So It's about finding the most "complete" tree.

If anyone has any other ideas on how to solve this problem or to efficiently clean up cluttered structures like this without going over every individual file I'd be very grateful!

Thanks B

b20000
  • 995
  • 1
  • 12
  • 30

1 Answers1

0

Maybe use the suffix tree data structure to find longest common substrings - even possibly with differences thus representing a similarity measure.

Create a new tree that mirrors the existing hierarcy in the sense of one node in the new tree per file/directory of the hierarchical structure.

As you build the tree : likely recursively for exapmple using FileFilter and descending for each entry that is a directory type :

for each node in the new tree create its path from the root down to that node. Make that path a key into a Map where the key is the path and the value is the node reference in your new tree.

Then you can employ a suffix tree algorithm against the keySet of this map to find entries that share common suffixes - which are precisely entries that can be de-duped.

That takes care of identical subtrees. The suffix tree also permits identifying "misses" -i.e. if there were one or more links in the path that differs between two paths.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560