2

I'm working on large data files stored in Google Cloud. I'm using a Python script which downloads first a blob containing json lines, and then opens it to analyze data line by line. This method is very slow, and i'd like to know if exists a faster way to do this. From command line i can use gsutil cat to stream data to stdout, is there a similar way to do this on Python?

This is what i currently do to read data:

myClient = storage.Client()
bucket = myClient.get_bucket(bucketname)
blob = storage.blob.Blob(blobname, bucket)
current_blob.download_to_filename("filename.txt")

file = open("filename.txt", "r")
data = f.readlines()

for line in data:
    # Do stuff

I want to read the blob line by line, without waiting for download.

Edit: i found this answer but the function isn't clear to me. I don't know how to read the streaming lines.

alcor
  • 515
  • 1
  • 8
  • 21

2 Answers2

0

In the answer you found, stream is a file-like object, so you should be able to use it instead of opening a specific filename. Something like this (untested):

myClient = storage.Client()
bucket = myClient.get_bucket(bucketname)
blob = storage.blob.Blob(blobname, bucket)
stream = open('myStream','wb', os.O_NONBLOCK)
streaming = blob.download_to_file(stream)

for line in stream.readlines():
    # Do stuff
Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
  • 2
    I tried that, it gives me the error `for line in stream.readlines(): io.UnsupportedOperation: read` . How can i make the object readable? – alcor Oct 01 '19 at 13:36
  • Opening the file as `wb` won't create the file if it doesn't exist already. Additionally, the file is not readable and `stream.readlines()` will fail with the error @alcor mentioned. The solution is to use `a+b` (open in read+write mode and create if it doesn't exist). – CoatedMoose Jul 24 '23 at 19:06
  • Additionally, this doesn't result in processing the file in a stream-like way (as written) - `stream.readlines()` is starting at the end of the file, so doesn't yield any lines. – CoatedMoose Jul 24 '23 at 19:12
0

Use the BlobReader.

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(bucketname)
blob = bucket.blob(blobname)
reader = storage.fileio.BlobReader(blob)

for line in reader:
    # Do stuff with each line
CoatedMoose
  • 3,624
  • 1
  • 20
  • 36