5

please help!

[+] What I have: A lot of blobs in every bucket. Blobs can vary in size from being less than a Kilo-byte to being lots of Giga-bytes.

[+] What I'm trying to do: I need to be able to either stream the data in those blobs (like a buffer of size 1024 or something like that) or read them by chunks of a certain size in Python. The point is I don't think I can just do a bucket.get_blob() because if the blob was a TeraByte then I wouldn't be able to have it in physical memory.

[+] What I'm really trying to do: parse the information inside the blobs to identify key-words

[+] What I've read: A lot of documentation on how to write to google cloud in chunks and then use compose to stitch it together (not helpful at all)

A lot of documentation on java's pre-fetch functions (needs to be python)

The google cloud API's

If anyone could point me the right direction I would be really grateful! Thanks

1 Answers1

4

So a way I have found of doing this is by creating a file-like object in python then using the Google-Cloud API call .download_to_file() with that file-like object.

This in essence streams data. python code looks something like this

def getStream(blob):
    stream = open('myStream','wb', os.O_NONBLOCK)
    streaming = blob.download_to_file(stream)

The os.O_NONBLOCK flag is so I can read while I'm writing to the file. I still haven't tested this with really big files so if anyone knows a better implementation or see's a potential failure with this please comment. Thanks!

  • I think you found the right method. Check here [the code](https://github.com/GoogleCloudPlatform/google-cloud-python/blob/d4d0abcab27b3f1aba9b56db4d643d5692230bd5/storage/google/cloud/storage/blob.py#L293) for further insight. – Rubén C. Jul 12 '18 at 08:17