Right now I am storing ~300Gb of images on google cloud storage (GCS). I have software that runs on a google compute engine (GCE) virtual machine (VM) and needs to read in all of these images and process them sequentially. These images do not need to be loaded into memory and can be streamed as input to the program. I'm having a lot of trouble finding an efficient way to do this.
I have tried:
1) GCSfuse. With GCSfuse I can mount the gcs-bucket on my VM and access the data directly. This seemed ideal at first but the I/O is prohibitively slow.
2) GSutil. This allows me to stream data into my program using "gsutil cp gs://my-gcs-bucket/training_data/*.jpg - | ". This works much better than GCSfuse but is still quite slow.
I guess I have two main questions. 1) What is the fastest way to access data stored in a GCS bucket and stream it as input to a script on a GCE VM? I will need to do this once a day but the demand could increase over time. 2) If there is no quick and clever way to do this, what alternatives do I have in terms of storage? Should I be using a different google cloud product? I want to avoid having to load all of the data directly onto the VM.
Thanks!