6

I am new in GCP dataflow.

I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file).

Here is my code

However, I encounter some error on GCP dataflow:

Traceback (most recent call last):
  File "main.py", line 169, in <module>
    run()
  File "main.py", line 163, in run
    shard_name_template='')
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\pipeline.py", line 426, in __exit__
    self.run().wait_until_finish()
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 1346, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 773, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 287, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 410, in load_session
    module = unpickler.load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 474, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute '_JsonSink' on <module 'dataflow_worker.start' from '/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py'>

I am able to run this script locally, but it fails when I try to use dataflowRunner

Please give me some suggestions.

PS. apache-beam version: 2.15.0

[Update1]

I try @Yueyang Qiu suggestion, add

pipeline_options.view_as(SetupOptions).save_main_session = True

The provided link says:

DoFn's in this workflow relies on global context (e.g., a module imported at module level)

This link supports the suggestion above.

However, the same error occurred.

So, I am thinking whether my implementation of _JsonSink (inherit from filebasedsink.FileBasedSink) is wrong or something else needed to be added.

Any opinion would be appreciated, thank you all!

han shih
  • 389
  • 1
  • 5
  • 13

4 Answers4

13

You have encountered a known issue that currently (as of 2.17.0 release), Beam does not support super() calls in main module on Python 3. Please take a look at possible solutions in BEAM-6158. Udi's answer is a good way to address this until BEAM-6158 is resolved, this way you don't have to run your pipeline on Python 2.

George S
  • 315
  • 2
  • 10
Valentyn
  • 516
  • 4
  • 7
7

Using the guidelines from here, I managed get your example to run.

Directory structure:

./setup.py
./dataflow_json
./dataflow_json/dataflow_json.py  (no change from your example)
./dataflow_json/__init__.py  (empty file)
./main.py

setup.py:

import setuptools

setuptools.setup(
  name='dataflow_json',
  version='1.0',
  install_requires=[],
  packages=setuptools.find_packages(),
)

main.py:

from __future__ import absolute_import

from dataflow_json import dataflow_json

if __name__ == '__main__':
    dataflow_json.run()

and you run the pipeline with python main.py.

Basically what's happening is that the '--setup_file=./setup.py' flag tells Beam to create a package and install it on the Dataflow remote worker. The __init__.py file is required for setuptools to identify the dataflow_json/ directory as a package.

Udi Meiri
  • 1,073
  • 8
  • 14
1

I finally find out the problem:

the class '_jsonsink' I implement using some features form Python3

However, I do not aware of what version of Python I am using for 'Dataflowrunner' (Actually, I have not figured out how to specify the python version for dataflow runner on GCP. Any suggestions?)

Hence, I re-write my code to Python2-compatible version, everything works fine!

Thanks for all of you!

han shih
  • 389
  • 1
  • 5
  • 13
  • 1
    Hi! The process of migration of Dataflow runner to python 3 is currently taking place. You can see the updates on the supported features here! https://jira.apache.org/jira/browse/BEAM-1251?subTaskView=unresolved. – Albert Albesa Nov 19 '19 at 10:49
0

Can you try setting option save_main_session = True as in here: https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/python/apache_beam/examples/cookbook/coders.py#L88.

Yueyang Qiu
  • 159
  • 5
  • Thanks for your suggestion. I try it and the error mentioned above disappeared and another error shows up: "ModuleNotFoundError: No module named 'libs' ", where 'libs' are my local modules. Does it mean that I can not use the local module with dataflowRunner? (everything is fine on local) – han shih Oct 22 '19 at 07:44
  • Check if there is an indentation problem and follow this doc: https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors – MonicaPC Oct 30 '19 at 02:02
  • @MonicaPC I have read that article before, but I do not find any good solution to solve my case. Hence, I try to simplify my problem. Now, a single file with a self-implemented class for data output. But it can not be recognized by dataflowRunner. – han shih Oct 31 '19 at 01:33