I have an archive of about 10 years worth of files that is a large directory tree structure, having multiple copies of smaller tries at various locations in the larger tree. The tree grew into this structure because of a lack of consistent backup strategy and filing strategy (all my own fault, basically).
I'm looking for a way to find identical copies of trees in the larger tree, such that I can delete the copies I don't need, moving me one step closer to cleaning this big mess up.
I thought I could write a script that would build a database of files in the tree, such that I could then write another script that finds trees that are identical, deleting the tree copy that is nested deepest in the tree.
However, I'm not sure how to best go about this, in terms of database design and what sort of algorithm to use to efficiently compare these trees to find identical copies.
To recap, this is what the tree looks like:
backups/folder1/ backups/somecomputer/vault/folder1 backups/othercomputer/folder1 ...
There is no guarantee that the trees are "complete" - it could be that the trees are similar but that only one copy of trees contains most files and subdirectories. So It's about finding the most "complete" tree.
If anyone has any other ideas on how to solve this problem or to efficiently clean up cluttered structures like this without going over every individual file I'd be very grateful!
Thanks B