0

I’m trying a very simple pipeline on Dataflow using a custom worker_harness_container_image (and experiment=beam_fn_api):

main.py:

import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
import logging


def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
    p = beam.Pipeline(options=pipeline_options)

    (
        p
        | "Read from BigQuery" >> beam.io.Read(beam.io.BigQuerySource(query="SELECT 1", use_standard_sql=True))
        | "ParDo" >> beam.ParDo(Dummy())
    )

    p.run().wait_until_finish()


class Dummy(beam.DoFn):
    def process(self, element):
        pass


if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)
    run()

The Dockerfile is just:

FROM apachebeam/python3.7_sdk

Launched like so:

python3.7 -m main \
--runner DataflowRunner \
--project project_id \
--temp_location gs://bucket/tmp/ \
--region europe-west1 \
--zone europe-north1-c \
--worker_harness_container_image eu.gcr.io/project_id/image:latest \
--experiment=beam_fn_api

This is failing with

Caused by: org.apache.beam.runners.dataflow.util.Structs$ParameterNotFoundException: 
didn’t find required parameter serialized_source in {@type=BigQueryAvroSource, 
bigquery_export_schema={value={“fields”:[{“mode”:“NULLABLE”,“name”:“f0_“,”type”:“INTEGER”}]}, @type=http://schema.org/Text}, 
filename={value=gs://bucket/000000000000.avro, @type=http://schema.org/Text}}

Note that reading the temporary Avro file output by the BigQuery-job using AvroIO works just fine, i.e.:

    (
        p
        | "Read from Avro" >> beam.io.Read(beam.io.avroio.ReadFromAvro("gs://bucket/000000000000.avro"))
        | "ParDo" >> beam.ParDo(Dummy())
    )
salient
  • 2,316
  • 6
  • 28
  • 43
  • Can you provide some more information about your environment? I would like to reproduce the error – rmesteves Jan 21 '20 at 13:47
  • Does your environment has something special to make your query (query="SELECT 1") run? It seems that its missing the "FROM" clause – rmesteves Jan 21 '20 at 13:51
  • @rmesteves — I updated the question to provide more details on the reproducer. The `SELECT 1` is just to create a minimal reproducer (it fails in similar ways with "real" queries). – salient Jan 21 '20 at 13:56
  • @rmesteves are you managing to reproduce this same issue? I have been trying without success. – Javier Bóbeda Jan 23 '20 at 15:57

2 Answers2

0

As per what I'm reading in this thread, Docker containers used for the Dataflow workers are currently private, and can't be modified or customized. That's precisely what worker_harness_container_image is doing, selecting another container. Also, I can't reproduce your issue and I can't find documentation for this method, so it seems to be unsupported. My advice would be to run your pipeline without worker_harness_container_image and see if it works as expected.

Javier Bóbeda
  • 468
  • 2
  • 10
  • If this solution doesn't work, I would suggest to open a ticket with [Google Platform Cloud Support](https://cloud.google.com/support/docs/) as it seems that this problem will need to be inspected with their internal tools, that will help to gather much more information of what is happening. – Javier Bóbeda Feb 03 '20 at 10:07
0

Which version of the Beam SDK are you using?

When using version:

apache-beam[gcp]==2.22.0

The following method seems to be newer, and works appropriately here when using custom worker images:

from apache_beam.io.gcp.bigquery import ReadFromBigQuery, WriteToBigQuery
ReadFromBigQuery(
   table=config.TABLE,
   dataset=config.DATASET,
   project=config.BQ_INPUT_PROJECT))
TDehaene
  • 1
  • 1