I am new in GCP dataflow.
I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file).
Here is my code
However, I encounter some error on GCP dataflow:
Traceback (most recent call last):
File "main.py", line 169, in <module>
run()
File "main.py", line 163, in run
shard_name_template='')
File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\pipeline.py", line 426, in __exit__
self.run().wait_until_finish()
File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 1346, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 773, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 287, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 410, in load_session
module = unpickler.load()
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 474, in find_class
return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute '_JsonSink' on <module 'dataflow_worker.start' from '/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py'>
I am able to run this script locally, but it fails when I try to use dataflowRunner
Please give me some suggestions.
PS. apache-beam version: 2.15.0
[Update1]
I try @Yueyang Qiu suggestion, add
pipeline_options.view_as(SetupOptions).save_main_session = True
The provided link says:
DoFn's in this workflow relies on global context (e.g., a module imported at module level)
This link supports the suggestion above.
However, the same error occurred.
So, I am thinking whether my implementation of _JsonSink (inherit from filebasedsink.FileBasedSink) is wrong or something else needed to be added.
Any opinion would be appreciated, thank you all!