0

I am trying to open files stored in a google-storage bucket in a google-colab workbook using the TPU engine. I am however always facing the error:

FileNotFoundError: [Errno 2] No such file or directory: 'gs://vocab_jb/merges.txt'

My question is very simple: how should I make a bucket in google-storage readable from google-colab? I have tried everything:

  1. Making the bucket public using IAM
  2. Assigning a special e-mail adress to the owner
  3. Making the file public through LCA options
  4. Followed x different tutorials
  5. I have tried each time calling the bucket through either "gs://bucket" or "https://..."

But none of the options worked correctly. What confuses me even more is that making the bucket public worked for a limited amount of time. I have also read this post but the answers didn't help. Also, I don't really care about the rights to read or write.

I am initializing my TPU in the following way:

import os 

use_tpu = True #@param {type:"boolean"}
bucket = 'vocab_jb'

if use_tpu:
    assert 'COLAB_TPU_ADDR' in os.environ, 'Missing TPU; did you request a TPU in Notebook Settings?'

from google.colab import auth
auth.authenticate_user()
%tensorflow_version 2.x
import tensorflow as tf
print("Tensorflow version " + tf.__version__)

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
  raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
with open("gs://vocab_jb/merges.txt", 'rb') as f:
  a = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'gs://vocab_jb/merges.txt'
halfer
  • 19,824
  • 17
  • 99
  • 186
Joachim
  • 490
  • 5
  • 24
  • If you made the object publicly readable within your bucket I don't see the part of the code where you actually download the file. Use any module just as requests or urllib to actually download the file (e.g. check this [post](https://stackoverflow.com/questions/49576657/is-there-anyway-i-can-download-the-file-in-google-colaboratory)) and only after the file is downloaded you could try to open it. Additionally, I strongly advise you to remove your bucket name and any other PII from the post, as it could lead to privacy issues. – Daniel Ocando Jan 26 '21 at 21:41

2 Answers2

1

You cannot open file on gcs by simply using os package. You would be able to do that if you would mount in your filesystem the gcs bucket so files are available to the os by FUSE perhaps. But to make things simple you should import gcs import cloudstorage as gcs and than use gcs_file = gcs.open(filename)

For more examples see Google Documentation for GCS https://cloud.google.com/storage/docs/downloading-objects#code-samples or example for app engine https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage

I hope this solves your problem.

1

Found this article that uses the library gcsfs that reads through a cloud storage bucket in colab. I looked up GCSFS and this library is in beta and is not an official Google library.

GCSFS is a pythonic file-system interface to Google Cloud Storage. This software is beta, use at your own risk.

Just make sure to install the library first in collab.

pip install gcsfs

Below is the implementation in your code:

import os 
import gcsfs
import google.auth
from google.colab import auth
auth.authenticate_user()

credentials, project_id = google.auth.default()
fs = gcsfs.GCSFileSystem(project=project_id, token=credentials)

use_tpu = True #@param {type:"boolean"}
bucket = 'vocab_jb'

if use_tpu:
    assert 'COLAB_TPU_ADDR' in os.environ, 'Missing TPU; did you request a TPU in Notebook Settings?'

%tensorflow_version 2.x
import tensorflow as tf
print("Tensorflow version " + tf.__version__)

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
  raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

reader = fs.open("gs://your-bucket-here/kinglear_on_roids.txt")
for text in reader:
  print(text)

Here is a snippet of the output when reading my sample file: enter image description here

Ricco D
  • 6,873
  • 1
  • 8
  • 18
  • It seems to work !!! I will never thank you enough, I would have never come to this library by myself. Perhaps, if I may ask, how do you save data to a file using this library ? This question also came to my mind as it is a next step in my code. I am aware my question is out of the scope and will post another question if necessary. But again, thank you a lot, this really helps me. – Joachim Jan 27 '21 at 07:56