3

I have a Google App Engine triggering a Cloud DataFlow pipeline. This pipeline is supposed to write the final PCollection to Google BigQuery, but I can't find a way to install the right apache_beam.io dependency.

I'm running Apache Beam version 2.2.0 locally.

The project structure follows the code from this blog post.

This is the relevant piece of code:

"WriteToBigQuery" >> beam.io.WriteToBigQuery(
            ("%s:%s.%s" % (PROJECT, DATASET, TABLE)),
            schema=TABLE_SCHEMA,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
        )

When I run this code locally, the beam.io.WriteToBigQuery() is called correctly. It is fetched from the apache_beam/io/gcp/bigquery.py from my virtual environment.

But I can't install this dependency on my lib folder that is shipped with the app on deploy.

Even though I have a requirements file containing apache-beam[gcp]==2.2.0 as a requirement, when I run pip install -r requirements.txt -t lib, the apache_beam/io/gcp/bigquery.py that is downloaded to my lib folder does not contain the class WriteToBigQuery, and then I get the error 'module' object has no attribute 'WriteToBigQuery' when running the app on Google App Engine.

Does anyone have any idea on how I can get the right bigquery.py?

Alex Kulinkovich
  • 4,408
  • 15
  • 46
  • 50
Hannon Queiroz
  • 443
  • 4
  • 22

1 Answers1

0

It is not immediately obvious, but to run in App Engine, as mentioned in the blog post it is necessary to create a setup.py (even if you already have a requirements.txt), and point to it via the --setup_file ./setup.py command line option when running the pipeline.

David Cavazos
  • 196
  • 1
  • 3