-1

I am currently trying to read pdf files stored in my google cloud storage. So far I have figured out how to read one file at a time from my google cloud storage, but I want to be able to loop through multiple files in my google cloud storage without manually reading them one by one. How can I do this? I have attached my code below.

Code part1

Code Part 2

12345
  • 27
  • 5
  • 1
    [please do not upload images of code. instead, post them as code blocks. thanks!](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question). Also, what have you tried? what's not working? – Michael Delgado Aug 11 '21 at 22:45

1 Answers1

0

To iterate all files in your bucket move your code for downloading and parsing in the for loop. Also I changed the for loop to for blob in blob_list[1:]: since GCS always prints the top folder in the first element and you do not want to parse that since it will result to and error. My folder structure used for testing is "gs://my-bucket/output/file.json....file_n.json".

Output when looping through the folder (for blob in blob_list:):

Output files:
output/
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json 

Output when skipping the first element (for blob in blob_list[1:]:):

Output files:
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json 

Loop through files. Skip the first element:

blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list[1:]:

    json_string = blob.download_as_string()
    response = json.loads(json_string)

    first_page_response = response['responses'][0]
    annotation = first_page_response['fullTextAnnotation']

    print('Full text:\n')
    print(annotation['text'])
    print('END OF FILE')
    print('##########################')

NOTE: If you have a different folder structure versus the test made, just adjust the index in the for loop.

Ricco D
  • 6,873
  • 1
  • 8
  • 18