GCP dataflow with python. "AttributeError: Can't get attribute '_JsonSink' on module 'dataflow_worker.start'

Question

I am new in GCP dataflow.

I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file).

Here is my code

However, I encounter some error on GCP dataflow:

Traceback (most recent call last):
  File "main.py", line 169, in <module>
    run()
  File "main.py", line 163, in run
    shard_name_template='')
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\pipeline.py", line 426, in __exit__
    self.run().wait_until_finish()
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 1346, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 773, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 287, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 410, in load_session
    module = unpickler.load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 474, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute '_JsonSink' on <module 'dataflow_worker.start' from '/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py'>

I am able to run this script locally, but it fails when I try to use dataflowRunner

Please give me some suggestions.

PS. apache-beam version: 2.15.0

[Update1]

I try @Yueyang Qiu suggestion, add

pipeline_options.view_as(SetupOptions).save_main_session = True

The provided link says:

DoFn's in this workflow relies on global context (e.g., a module imported at module level)

This link supports the suggestion above.

However, the same error occurred.

So, I am thinking whether my implementation of _JsonSink (inherit from filebasedsink.FileBasedSink) is wrong or something else needed to be added.

Any opinion would be appreciated, thank you all!

score 13 · Answer 1 · edited Dec 18 '19 at 14:39

13

You have encountered a known issue that currently (as of 2.17.0 release), Beam does not support super() calls in main module on Python 3. Please take a look at possible solutions in BEAM-6158. Udi's answer is a good way to address this until BEAM-6158 is resolved, this way you don't have to run your pipeline on Python 2.

edited Dec 18 '19 at 14:39

George S

315
2
10

answered Nov 22 '19 at 01:58

Valentyn

516
4
7

score 7 · Answer 2 · answered Nov 13 '19 at 21:23

Using the guidelines from here, I managed get your example to run.

Directory structure:

./setup.py
./dataflow_json
./dataflow_json/dataflow_json.py  (no change from your example)
./dataflow_json/__init__.py  (empty file)
./main.py

setup.py:

import setuptools

setuptools.setup(
  name='dataflow_json',
  version='1.0',
  install_requires=[],
  packages=setuptools.find_packages(),
)

main.py:

from __future__ import absolute_import

from dataflow_json import dataflow_json

if __name__ == '__main__':
    dataflow_json.run()

and you run the pipeline with python main.py.

Basically what's happening is that the '--setup_file=./setup.py' flag tells Beam to create a package and install it on the Dataflow remote worker. The __init__.py file is required for setuptools to identify the dataflow_json/ directory as a package.

It worked for me, thank you a lot man! – Jessé Catrinck Oct 22 '21 at 23:02 — Jessé Catrinck, Oct 22 '21 at 23:02

score 1 · Answer 3 · answered Nov 15 '19 at 02:33

1

I finally find out the problem:

the class '_jsonsink' I implement using some features form Python3

However, I do not aware of what version of Python I am using for 'Dataflowrunner' (Actually, I have not figured out how to specify the python version for dataflow runner on GCP. Any suggestions?)

Hence, I re-write my code to Python2-compatible version, everything works fine!

Thanks for all of you!

answered Nov 15 '19 at 02:33

han shih

389
1
5
13

1

Hi! The process of migration of Dataflow runner to python 3 is currently taking place. You can see the updates on the supported features here! https://jira.apache.org/jira/browse/BEAM-1251?subTaskView=unresolved. – Albert Albesa Nov 19 '19 at 10:49

score 0 · Answer 4 · answered Oct 18 '19 at 07:14

0

Can you try setting option save_main_session = True as in here: https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/python/apache_beam/examples/cookbook/coders.py#L88.

answered Oct 18 '19 at 07:14

Yueyang Qiu

159
5

Thanks for your suggestion. I try it and the error mentioned above disappeared and another error shows up: "ModuleNotFoundError: No module named 'libs' ", where 'libs' are my local modules. Does it mean that I can not use the local module with dataflowRunner? (everything is fine on local) – han shih Oct 22 '19 at 07:44
Check if there is an indentation problem and follow this doc: https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors – MonicaPC Oct 30 '19 at 02:02
@MonicaPC I have read that article before, but I do not find any good solution to solve my case. Hence, I try to simplify my problem. Now, a single file with a self-implemented class for data output. But it can not be recognized by dataflowRunner. – han shih Oct 31 '19 at 01:33

GCP dataflow with python. "AttributeError: Can't get attribute '_JsonSink' on module 'dataflow_worker.start'

4 Answers4

Linked