-1

I'm trying to move 15 million files (4TB) from various buckets using gsutil mv. Unfortunately, the previous file names do not share any common prefixes; instead they're all postfixed with our "filetype" identifiers.

During this transfer, I'm shooting to rename the files as well to prevent this mayhem in the future.

This is our current file format:

gs://{bucket}/{hash}-{filetype}.{extn}

This is the format I'm renaming them to:

gs://{bucket}/{filetype}/{hash}.{extn}

Current Solution:

Because the current format is not conducive to "prefix" selectors, I have to do the following:

let { stdout } = spawn('gsutil', ['ls', `${OLD_BUCKET}/*-${TYPE}`]);
createInterface({ input:stdout }).on('line', str => {
  if (!str.length) return;
  let [hash] = str.replace(OLD_BUCKET, '').split('-');
  let nxt = `${NEW_BUCKET}/${TYPE}/${hash}.${extn}`;
  spawnSync('gsutil', ['mv', str, nxt]);
});

Mildly redacted for brevity.

Oddly, gsutil ls is the only command that recognizes glob-based patterns. Taking advantage of this, I'm piping each line into my "format transformer" and then using gsutil mv to initiate the transfer.

This action is running on a 16-core machine, with each core performing the same task -- but ls'ing with a different filetype.

The Problem

This is incredibly slow!

I've thrown more servers and more cores at it & I cannot break 26 files per minute per filetype. I've also tried adding the -m flag to gsutil mv with no difference -- because mv is invoked once per line.

We have 13 filetypes; so 20,280 files are transferred per hour. Compare this to GCP's "Transfer" tool, which copied 5M files from BucketA to BackupA in less than an hour.

The Question

Is there any way to speed this up?

At the current rate, I'm looking at 15 days until the transfer(s) are complete, paying by the hour

lukeed
  • 409
  • 2
  • 5
  • 13
  • Since your mv command is just waiting on GCS to do the work, I think you can run many more than 1 command per core. Hopefully you can parallelize more this way. – Hitobat May 24 '18 at 20:30
  • You would think so, yeah. But running 2 per core resulted in a `EAGAIN` error. That's why I spun up another 3 machines with 16 cores each. Even WITH all 4 servers running (4 `mv` per filetype), each individual bucket was limited to the ~26 files per minute. AKA there was 0 gain at all in parallelizing the load across _machines_. – lukeed May 24 '18 at 20:41
  • That's interesting. Thanks to your comment I went looking for artificial rate limits and found these docs. https://cloud.google.com/storage/docs/request-rate#ramp-up – Hitobat May 24 '18 at 20:46

1 Answers1

0

I'm not very familiar with Rust, but from what i know working with GCP and c++ (for instance management), the slowest part is waiting for google's response on each operation (like gutil ls or gutil mv), this way, even on a 16 core machine, most of the threads will be idle waiting for the google's response for the operation.

So you probably want to optimize the amount of requests you do, and let google do the hard work. (You also pay for the resources used by google to perform these operations, not including your VM instance that does the requests)

Checking the gsutil mv documentation, it says the -m will make that operation multi-threaded (in google's move service), but that will only work if that same requests includes multiple files to move.

Also the docs mention that the gsutil mv is an extension of the gsutil cp that copies the files to the target path and then performs a delete on the old path (gsutil cp documentation) inheriting the command options.

So my tip would be for you to try to compress the requests using Name WildCards, and check the gsutil cp docs for a list of flags you can also set to mv.

Sorry for not posting a code answer, time is a bit short for me. Hope this can help you with some direction :)

  • Thank you for your help! I don't think `mv` accepts multiple files at once, only a `from` and a `to`. This limitation would be fine, except that I can't perform the filename rewrite inline. I may _move_ all files over to the new bucket with the wildcard, but I'm still stuck renaming one at a time AFAICT. – lukeed May 24 '18 at 19:46
  • 1
    gsutil does accept multiple source files but they all have to move to one destination directory (similar to the Unix mv syntax). And since you're changing the filename for each move that gsutil mv support doesn't help you. I also am not familiar with rust. Does createInterface parallelize on the input lines? If not, that's your problem. You could instead us gsutil ls to create the complete list of objects, then edit the list locally (using an editor or sed) to produce a sequence of gsutil mv commands. And then you could run N of those commands at a time. – Mike Schwartz May 24 '18 at 20:07
  • Yeah, I'm fine with the single destination. `mv *-foobar.jpg foobar/*.jpg` is the desired format, but again the problem is the renaming. Not sure where you both got Rust from, but this is actually Node.js -- `createInterface` comes from the [`readline`](https://nodejs.org/api/readline.html#readline_readline_createinterface_options) module. AKA, no it doesn't parallelize :P – lukeed May 24 '18 at 20:38
  • I believe @MikeSchwartz is correct, but if you really cant run multiple files per `mv` request, you could try to get the complete list, prepare the command line to move each file, and run each `mv` cmd in separate threads. You could use a thread pool, setting a maximum thread count to run the requests (you can create thousands of threads, since most of them will be sleeping waiting for google's response anyway), and the threads would pick new commands to run when they are done, until the command queue is empty – Leonardo Trocato May 24 '18 at 20:45
  • Update: I've written a new program using Rust to make use of a much larger thread pool. The old items are preemptively written to a file (before program starts), which is then piped into the prog line-by-line, transformed, and then a `gsutil mv` instance is invoked. Even with the much larger thread pool, I'm getting ~370 files per minute, which is only +40 over the Node.js average. The evidence is stacking against a Bucket bandwidth limit. – lukeed May 29 '18 at 05:38