Here is how I normally download a GCS file to local:
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
blob.download_to_filename('myBigFile.txt)
The files that I am working with a much, much larger than the allowable size/memory of the Cloud Functions (for example, several GBs to several TBs), so the above would not work for these large files.
Is there a simpler, "streaming" (see example 1 below) or "direct-access" (see example 2 below) way to work with GCS files in a Cloud Function?
Two examples of what I'd be looking to do would be:
# 1. Load it in chunks of 5GB -- "Streaming"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
while True:
data = blob.download_to_filename('myBigFile.txt', chunk_size=5GB)
do_something(data)
if not data: break
Or:
# 2. Read the data from GCS without downloading it locally -- "Direct Access"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
with blob.read_filename('myBigFile.txt') as f:
do_something(f)
I'm not sure if either of these are possible to do, but I'm leaving a few options of how this could work. It seems like the Streaming Option is supported, but I wasn't sure how to apply it to the above case.