2

Right now I am storing ~300Gb of images on google cloud storage (GCS). I have software that runs on a google compute engine (GCE) virtual machine (VM) and needs to read in all of these images and process them sequentially. These images do not need to be loaded into memory and can be streamed as input to the program. I'm having a lot of trouble finding an efficient way to do this.

I have tried:

1) GCSfuse. With GCSfuse I can mount the gcs-bucket on my VM and access the data directly. This seemed ideal at first but the I/O is prohibitively slow.

2) GSutil. This allows me to stream data into my program using "gsutil cp gs://my-gcs-bucket/training_data/*.jpg - | ". This works much better than GCSfuse but is still quite slow.

I guess I have two main questions. 1) What is the fastest way to access data stored in a GCS bucket and stream it as input to a script on a GCE VM? I will need to do this once a day but the demand could increase over time. 2) If there is no quick and clever way to do this, what alternatives do I have in terms of storage? Should I be using a different google cloud product? I want to avoid having to load all of the data directly onto the VM.

Thanks!

Tyler Garvin
  • 101
  • 1
  • 4

1 Answers1

1

gsutil should be the fastest way to fetch items from google cloud storage. GCS generally should get you pretty high throughput, but with long latency to the first byte.

If you have a large number of small files (jpeg training data probably falls into that category) you might want to tar/zip them up into a larger archive.

If that doesn't work for you, and all of your files are less than 1MB, you could use Google Cloud Datastore, which is more expensive, but much lower latency.