0

I'm processing over 200,000 netcdf files and each file is 17 MB. They are all in a google cloud storage bucket and I am trying to find a way to increase the throughput using gcsfuse.

I am using the google cloud compute engine virtual machine and gcsfuse to access the files. I looked into gsutil but read in the Google Cloud documentation that "individual I/O streams run approximately as fast as gsutil." Using gcsfuse the NCL script will take over 8 days which is too long. Any suggestions on how to improve the throughput? Thank you.

yombob
  • 59
  • 6

1 Answers1

2

Optimizing transfer performance you have to consider:

  1. Locating your Cloud Storage Bucket and Compute Engine VM Instance in that same region.
  2. Increasing your Compute Engine VM Instance Network Bandwidth by creating the instance with more vCPU cores
  3. Increasing persistent disk throughput
  4. Using gsutil -r and with the -m option to run tasks in parallel you can even set number of threads used to copy files via parallel_thread_count
  5. Please check this documentation on scripting transfer
  6. While using gcsfuse, checking if you have version 0.27.0 that is optimized for parallel transfers.
Pawel Czuczwara
  • 1,442
  • 9
  • 20
  • Thank you! This is very helpful. I have recently found that creating several mounting points with gcsfuse helps a lot. – yombob Jul 18 '19 at 13:55