0

My python code for dataflow job looks like below:

import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions

topic1="topic1"
conf={'bootstrap.servers':'gcp_instance_public_ip:9092'}

pipeline = beam.Pipeline(options=PipelineOptions())

(pipeline
        |       ReadFromKafka(consumer_config=conf,topics=['topic1'])
)
pipeline.run()

As i am using kafkaIO in python code, someone suggested me to use DataflowRunner_V2( I think V1 doesn't support python).

As per dataflow documentation, i am using this parameter to use runner v2:--experiments=use_runner_v2 (I have not made any change on code level for switching from V1 to V2.) I am getting below error:

http_response, method_config=method_config, request=request)
    apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/metal-voyaasfger-23424/locations/us-central1/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Wed, 08 Jul 2020 07:23:21 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '400', 'content-length': '544', '-content-encoding': 'gzip'}>, content <{
      "error": {
        "code": 400,
        "message": "(5fd1bf4d41e8b7e): The workflow could not be created. Causes: (5fd1bf4d41e8018): The workflow could not be created due to misconfiguration. If you are trying any experimental feature, make sure your project and the specified region support that feature. Contact Google Cloud Support for further help. Experiments enabled for project: [enable_streaming_engine, enable_windmill_service, shuffle_mode=service], experiments requested for job: [use_runner_v2]",
        "status": "INVALID_ARGUMENT"
      }
    }

I have already added service account using export GOOGLE_APPLICATION_CREDENTIALS=(project owner permission) command. Can someone help where is my mistake. Am i mistaking using Runner_V2?

I will really thnkful if someone shortly tell whats difference in using Runner_v1 and Runner_V2.

Thanks ... :)

Joseph N
  • 540
  • 8
  • 28
  • Yes, these transforms require Dataflow runner V2. Is it possible that you are not running in one of the supported regions mentioned here ? - https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2 – chamikara Jul 09 '20 at 05:55
  • yes.. i am running it on us-central-1(which is supported) and on free trial GCP account. – Joseph N Jul 09 '20 at 06:36
  • Are you using the SDK v2.21 or above? https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2 – Peter Kim Jul 10 '20 at 17:52
  • You also need to specify the `--enable_streaming_engine` flag, if you haven't already – Peter Kim Jul 10 '20 at 17:55
  • @Peter Kim, I have installed apache_beam package and running dataflow job from terminal. What do you mean by SDK here? – Joseph N Jul 10 '20 at 17:56
  • What is your package version? – Peter Kim Jul 10 '20 at 17:57
  • It must be latest version as I installed it using latest pip3 – Joseph N Jul 10 '20 at 17:59

1 Answers1

0

I was able to reproduce your issue. The error message was complaining that use_runner_v2 isn't enabled because Runner v2 isn't enabled for batch jobs.

Experiments enabled for project: [enable_streaming_engine, enable_windmill_service, shuffle_mode=service], experiments requested for job: [use_runner_v2]",

Please try running your job with the --streaming flag added.

Peter Kim
  • 1,929
  • 14
  • 22
  • what if one's not running a streaming job? Basically my case is that I want to play with number of harness threads in machines, but it won't let me do so unless I specify runner_v2, and runner_v2 is not for batch jobs? – SpiXel Aug 10 '20 at 11:30
  • @SpiXel Until it becomes officially supported, your project needs to be allowlisted. – Peter Kim Aug 10 '20 at 13:51