5

Having skimmed the Google Cloud Dataflow documentation, my impression is that worker VMs run a specific predefined Python 2.7 environment without any option to change that. Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?

sandris
  • 1,363
  • 13
  • 27

4 Answers4

4

2021 Update

As of today, the answer to both of this questions is YES.

  1. Python 3 is supported on Dataflow.
  2. Custom container images are supported on Dataflow, see this SO answer, and this docs page.

Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?

No and no to both questions. You're able to configure Compute Engine instance machine type and disk size for a Dataflow job, but you're not able to configure things like installed applications. Currently, Apache Beam does not support Python 3.x.

References:

  1. https://cloud.google.com/dataflow/pipelines/specifying-exec-params
  2. https://issues.apache.org/jira/browse/BEAM-1251
  3. https://beam.apache.org/get-started/quickstart-py/
Pablo
  • 10,425
  • 1
  • 44
  • 67
Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
  • 2
    Just an update, I know it a very old answer but current version on apache beam for python supports python3 – Deepak Verma Jul 17 '19 at 17:59
  • apologies for bumping an old thread, I understand that google dataflow currently supports custom containers, but does it support custom vm images for workers instances compute engine, I read through the documents and I couldn't find any. thanks for responding – Sathish May 18 '23 at 12:31
  • the custom container images are only for the docker instances running inside the worker vm instances – Sathish May 18 '23 at 14:40
4

Python 3 support in to Apache Beam status: https://beam.apache.org/roadmap/python-sdk/#python-3-support

pjesa
  • 146
  • 2
3

You cannot provide a custom VM image for the workers, but you can provide a setup.py file to run custom commands and install libraries.

You can find more info about the setup.py file here: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

Robbe
  • 2,610
  • 1
  • 20
  • 31
1

Custom containers are now supported on Dataflow.

robertwb
  • 4,891
  • 18
  • 21