Recursive copy to a flat directory

Question

I have a directory of images, currently at ~117k files for about 200 gig in size. My backup solution vomits on directories of that size, so I wish to split them into subdirectories of 1000. Name sorting or type discrimination is not required. I just want my backups to not go nuts.

From another answer, someone provided a way to move files into the split up configuration. However, that was a move, not a copy. Since this is a backup, I need a copy.

I have three thoughts:

1. Files are added to the large directory with random filenames, so alpha sorts aren't a practical way to figure out deltas. Even using a tool like rsync, adding a couple hundred files at the beginning of the list could cause a significant reshuffle and lots of file movement on the backup side.

2. The solution to this problem is to reverse the process: Do an initial file split, add new files to the backup at the newest directory, manually create a new subdir at the 1000 file mark, and then use rsync to pull files from the backup directories to the work area, eg rsync -trvh <backupdir>/<subdir>/ <masterdir>.

3. While some answers to similar questions indicate that rsync is a poor choice for this, I may need to do multiple passes, one of which would be via a slower link to an offsite location. The performance hit of using rsync and its startup parsing is far superior to the length of time reuploading the backup on a daily basis would take.

My question is:

How do I create a script that will recurse into all 117+ subdirectories and dump the contained files into my large working directory, without a lot of unnecessary copying?

My initial research produces something like this:

#!/bin/bash
cd /path/to/backup/tree/root
find . -type d -exec rsync -trvh * /path/to/work/dir/

Am I on the right track here?

It's safe to assume modern versions of bash, find, and rsync.

Thanks!

What _is_ that backup solution that can't handle directories with a lot of large files? I'd look for a new backup solution. — Ted Lyngmo, Mar 31 '21 at 21:09
Does the backup solution have options? Can it be set to follow symlinks to copy the content rather that the link? — Ted Lyngmo, Mar 31 '21 at 21:35
The only modification I can make on the backup solution is the arrangement of the files, and that will be sufficient to solve this issue, once I learn how to script it. — RainbowW, Mar 31 '21 at 21:47
Would it be feasible to have to use a special tool to put new files into the big directory? — Ted Lyngmo, Mar 31 '21 at 21:55
Depends on the tool; I don't really see the point when the processing time for rsync is negligible, and it's already installed, pretty familiar to me, and free. — RainbowW, Mar 31 '21 at 22:33
Can the directory to be backed up and the working directory be put on the same filesystem? Then you can use hard links to provide a "window" into the hierarchy which would be much faster than copying the files. Although I admit I'm not sure what using rsync is intended to accomplish...? — Vercingatorix, Apr 01 '21 at 00:39
The arrangement currently is: 1) Working dir on a desktop machine, 2) local backup on a server available by smb or rsyncd over ssh, 3) off-site under my control backup by rsync/ssh, 4) off-site cloud solution specified by management. I can use the local backup as the source for 4, and I can script the xfer from local backup to 3. 3 can't be the source for 4 because 3 is not AMD64 and the provider doesn't support armhf. I'm trying to go 2->1 via rsync on a cron job to make the dump to a single directory easy. A simple looping script will work; I just don't know the syntax. That's the question. — RainbowW, Apr 01 '21 at 04:09
@RainbowW I was thinking of a command to store the file and then store it in your big directory and also in a hash-based directory structure with a hard link between. The backup would use the hash-based directory structure and would be happy. You as a user could use the big directory. — Ted Lyngmo, Apr 01 '21 at 05:05

Recursive copy to a flat directory

0 Answers0