I'm trying to move 15 million files (4TB) from various buckets using gsutil mv
. Unfortunately, the previous file names do not share any common prefixes; instead they're all postfixed with our "filetype" identifiers.
During this transfer, I'm shooting to rename the files as well to prevent this mayhem in the future.
This is our current file format:
gs://{bucket}/{hash}-{filetype}.{extn}
This is the format I'm renaming them to:
gs://{bucket}/{filetype}/{hash}.{extn}
Current Solution:
Because the current format is not conducive to "prefix" selectors, I have to do the following:
let { stdout } = spawn('gsutil', ['ls', `${OLD_BUCKET}/*-${TYPE}`]);
createInterface({ input:stdout }).on('line', str => {
if (!str.length) return;
let [hash] = str.replace(OLD_BUCKET, '').split('-');
let nxt = `${NEW_BUCKET}/${TYPE}/${hash}.${extn}`;
spawnSync('gsutil', ['mv', str, nxt]);
});
Mildly redacted for brevity.
Oddly, gsutil ls
is the only command that recognizes glob-based patterns. Taking advantage of this, I'm piping each line into my "format transformer" and then using gsutil mv
to initiate the transfer.
This action is running on a 16-core machine, with each core performing the same task -- but ls
'ing with a different filetype.
The Problem
This is incredibly slow!
I've thrown more servers and more cores at it & I cannot break 26 files per minute per filetype. I've also tried adding the -m
flag to gsutil mv
with no difference -- because mv
is invoked once per line.
We have 13 filetypes; so 20,280 files are transferred per hour. Compare this to GCP's "Transfer" tool, which copied 5M files from BucketA to BackupA in less than an hour.
The Question
Is there any way to speed this up?
At the current rate, I'm looking at 15 days until the transfer(s) are complete, paying by the hour