1

I have a fairly large dataset (~160TB) that need to be delivered to a client every so often. This dataset consists of fairly large files, usually between 2Gb and 20Gb each. They exist on a BeeGFS filesisystem running on a RAID cluster with a total capacity of 1.1Tb. Currently, when it's time to deliver the data, it is done the following way:

  1. Create a mainindex of the files and their sizes
  2. Tally up filesizes until 4Tb, and make a sub-index of said files from the main index
  3. Copy files over to 4Tb USB drives
  4. Repeat step 2 and 3 until the entire dataset has been copied
  5. Give a cardboard box of USB drives to the client

What I would like to do is to just rsync this over to a mounted filesystem, so I was wondering if there is a filesystem available that can spread the storage space over multiple disks? The obvious candidates are LVM and RAID, but the problem is that the client needs to be able to read each disk on its own, which outrules this (as far as I know, at least). Is there a way of emulating LVM or something similar, but allows for individual disks to be read in a fairly standard way? In effect, allowing me to run a single rsync operation that will spread the data over multiple individual disks/filesystems

The data comes from a redhat machine, so I've simply used ext4 on the USB drives so far. However, if possible, it would be very beneficial (although not strictly necessary) for everyone if I could use a filesystem that played nicely with Windows10.

PS: I have no limitations when it comes to the amount of USB drives attached at the same time. The only real constraint I have is that the data must be accessible one disk/filesystem at a time.

Jarmund
  • 535
  • 2
  • 6
  • 17

1 Answers1

1
  1. create the full list of files and sizes, something like:

    find /path -type f -printf "%s %h%f\n" > all_files.txt

  2. run an awk that splits all_files.txt into parts, based on the total size for each part (MAXSIZE here is a placeholder for maximum size in bytes)

    BEGIN {total=0;part=0;}
            {total += $1;
            if (total > MAXSIZE) {part++;total=0;}
            $1="";print substr($0,2) >> "partial-"part}
  1. You can now mount all the disks at different mount points (something like /mnt/send/partial-1, /mnt/send/partial-2,...), using whichever filesystem you want in each one.

  2. Within a loop you rsync with --files-from=FILE to the right mount point. Something along these lines:

    for f in partial*
    do
        rsync --files-from=$f / /mnt/$f/
    done
Eduardo Trápani
  • 1,210
  • 8
  • 12
  • That's kind of what I am doing now with a bit of perl, but I am looking for a filesystem-oriented sollution to allow for rsync to just mirror/update the dataset on the USB, as doing filesystem-wide operations on them as a whole before disconnecting the mirror is also useful. – Jarmund Apr 14 '20 at 21:43
  • 1
    Ok, if it has to be a filesystem then you are probably looking for something like [mhddfs](https://packages.debian.org/buster/mhddfs). – Eduardo Trápani Apr 14 '20 at 22:40
  • mhddfs did the trick! – Jarmund Jun 12 '20 at 09:19