0

I'm trying to write to GCS bucket via Beam (and TF Transform). But I keep getting the following error:

ValueError: Unable to get the Filesystem for path [...]

The answer here and some other sources suggest that I need to pip install aache-beam[gcp] to get a different variant of Apache Beam that works with GCP.

So, I tried changing the setup.py of my training package as:

REQUIRED_PACKAGES = ['apache_beam[gcp]==2.14.0', 'tensorflow-ranking', 'tensorflow_transform==0.14.0']

which didn't help. I also tried adding the following to the beginning of my code:

subprocess.check_call('pip uninstall apache-beam'.split())
subprocess.check_call('pip install apache-beam[gcp]'.split())

which didn't work either.

The logs of the failed GCP job is here. The traceback and the error message appear on row 276.

I should mention that running the same code using Beam's DirectRunner and writing the outputs to local disk runs fine. But I'm now trying to switch to DataflowRunner.

Thanks.

Milad Shahidi
  • 627
  • 7
  • 13

1 Answers1

-1

It turns out that you need to uninstall google-cloud-dataflow in addition to installing apache-beam with the gcp option. I guess this happens because google-cloud-dataflow is installed on GCP instances by default. Not sure if the same would be true on other platforms like AWS. But anyway, here are the commands I used:

pip uninstall -y google-cloud-dataflow
pip install apache-beam[gcp]

I noticed this in the very first cell of [this notebook] (https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/wals_tft.ipynb).

Milad Shahidi
  • 627
  • 7
  • 13