9

I am using Google Cloud to train a neural network on the cloud like in the following example:

https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow

To start I set the following to environmental variables:

PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET_NAME=${PROJECT_ID}-mlengine

I then uploaded my training and evaluation data, both csv's with the names eval_set.csv and train_set.csv to Google cloud storage with the following command:

gsutil cp -r data gs://$BUCKET_NAME

I then verified that these two csv files where in the polar-terminal-160506-mlengine/data directory on my Google Cloud storage.

I then did the following environmental variable assignments

# Assign appropriate values.
PROJECT=$(gcloud config list project --format "value(core.project)")
JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"
GCS_PATH="${BUCKET}/${USER}/${JOB_ID}"
DICT_FILE=gs://cloud-ml-data/img/flower_photos/dict.txt

Before trying to preprocess my evaluation data like so:

# Preprocess the eval set.
python trainer/preprocess.py \
  --input_dict "$DICT_FILE" \
  --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
  --output_path "${GCS_PATH}/preproc/eval" \
  --cloud

Sadly, this runs for a bit and then crashes outputting the following error:

ValueError: Unable to get the Filesystem for path gs://polar-terminal-160506-mlengine/data/eval_set.csv

This doesn't seem possible as I have confirmed with my eyes via my Google Cloud Storage console that eval_set.csv is stored at this location. Is this perhaps a permissions issue or something I am not seeing?

Edit:

I have found the cause of this run time error to be from a certain line in the trainer.preprocess.py file. The line is this one:

read_input_source = beam.io.ReadFromText(
      opt.input_path, strip_trailing_newlines=True)

Seems like a pretty good clue but I am still not really sure what is going on. When I google "beam.io.ReadFromText ValueError: Unable to get the Filesystem for path" nothing relevant at all appears which is a bit odd. Thoughts?

sometimesiwritecode
  • 2,993
  • 7
  • 31
  • 69

3 Answers3

12

It looks like your apache-beam library installation might be incomplete.

try pip install apache-beam[gcp]

It allows apache beam to access files stored on Google Cloud Storage.

Apache Beam package available here

  • hi, trying to solve the same problem here. I couldn't find this library in PyPi: `no matches found: apache-beam[gcp]` – Lucas Shen Jan 24 '18 at 16:02
  • @LucasShen It appears it is available on Pypi [here](https://pypi.python.org/pypi/apache-beam). Perhaps your python version is not compatible with the package? – Jean-Christophe Rodrigue Mar 08 '18 at 18:42
2

Just as Jean-Christophe described, I believe your installation is incomplete.

The apache-beam package doesn't include all the stuff to read/write from GCP. To get all that, as well as the runner for being able to deploy your pipeline to CloudDataflow (the DataRunner), you'll need to install it via pip.

pip install google-cloud-dataflow

This is how I was able to resolve the same issue.

adityajones
  • 601
  • 1
  • 4
  • 10
1

Try pip install apache_beam[gcp]. This will help you.

New_Coder
  • 109
  • 1
  • 8