How to increase gcsfuse throughput when looping through many files?

Question

I'm processing over 200,000 netcdf files and each file is 17 MB. They are all in a google cloud storage bucket and I am trying to find a way to increase the throughput using gcsfuse.

I am using the google cloud compute engine virtual machine and gcsfuse to access the files. I looked into gsutil but read in the Google Cloud documentation that "individual I/O streams run approximately as fast as gsutil." Using gcsfuse the NCL script will take over 8 days which is too long. Any suggestions on how to improve the throughput? Thank you.

Things you might consider to increase throughput --> https://cloud.google.com/compute/docs/disks/performance — Christopher, Jul 18 '19 at 04:55

Pawel Czuczwara · Accepted Answer · 2019-07-19T07:12:49.363

2

Optimizing transfer performance you have to consider:

Locating your Cloud Storage Bucket and Compute Engine VM Instance in that same region.
Increasing your Compute Engine VM Instance Network Bandwidth by creating the instance with more vCPU cores
Increasing persistent disk throughput
Using gsutil -r and with the -m option to run tasks in parallel you can even set number of threads used to copy files via parallel_thread_count
Please check this documentation on scripting transfer
While using gcsfuse, checking if you have version 0.27.0 that is optimized for parallel transfers.

edited Jul 19 '19 at 07:12

answered Jul 18 '19 at 10:42

Pawel Czuczwara

1,442
9
20

Thank you! This is very helpful. I have recently found that creating several mounting points with gcsfuse helps a lot. – yombob Jul 18 '19 at 13:55

How to increase gcsfuse throughput when looping through many files?

1 Answers1