0

I'm using apache-beam python 2.12.0 SDK. I'm having issues when using some special character * for beam.io.ReadFromText

gs://mybucket/learning/pack_operation/20190524_1_0_/extracted-*.json

This is the input to my beam Job

with beam.Pipeline(args.runner, pipeline_options) as pipeline:
        outputs = (
                pipeline
                | 'ReadFromFile' >> beam.io.ReadFromText(options['input_filebase'])
                | 'DecodeLine' >> beam.Map(Utils.decode_input(ids))
                | 'Batch' >> beam.ParDo(BatchDoFn(options['batch_size']))
                | 'Predict' >> beam.ParDo(PredictDoFn(model_file, fields))
                | 'Unbatch' >> beam.ParDo(UnBatchDoFn())
                | 'FormatOutput' >> beam.Map(Utils.format_output)
)

The thing is I'm having the same exception even when hardcoding the input_file path

gs://mybucket/learning/pack_operation/20190524_1_0_/extracted-000000000000.json.json

Here is exception

Output:
[2019-05-24 15:02:59,997] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,996] {bash_operator.py:101} INFO - /usr/local/lib/python2.7/site-packages/oauth2client/contrib/gce.py:99: UserWarning: You have requested explicit scopes to be used with a GCE service account.
[2019-05-24 15:02:59,999] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - Using this argument will have no effect on the actual scopes for tokens
[2019-05-24 15:02:59,999] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - requested. These scopes are set at VM instance creation time and
[2019-05-24 15:03:00,000] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - can't be overridden in the request.
[2019-05-24 15:03:00,000] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - 
[2019-05-24 15:03:00,000] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - warnings.warn(_SCOPES_WARNING)
[2019-05-24 15:03:00,001] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,997] {bash_operator.py:101} INFO - Traceback (most recent call last):
[2019-05-24 15:03:00,001] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - File "/home/airflow/gcs/dags/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 87, in <module>
[2019-05-24 15:03:00,001] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - main()
[2019-05-24 15:03:00,002] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - File "/home/airflow/gcs/dags/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 73, in main
[2019-05-24 15:03:00,002] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - | 'FormatOutput' >> beam.Map(Utils.format_output)
[2019-05-24 15:03:00,002] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/textio.py", line 536, in __init__
[2019-05-24 15:03:00,003] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,998] {bash_operator.py:101} INFO - skip_header_lines=skip_header_lines)
[2019-05-24 15:03:00,003] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/textio.py", line 120, in __init__
[2019-05-24 15:03:00,003] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - validate=validate)
[2019-05-24 15:03:00,004] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/filebasedsource.py", line 121, in __init__
[2019-05-24 15:03:00,004] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - self._validate()
[2019-05-24 15:03:00,004] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/options/value_provider.py", line 137, in _f
[2019-05-24 15:03:00,005] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - return fnc(self, *args, **kwargs)
[2019-05-24 15:03:00,005] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:02:59,999] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/filebasedsource.py", line 178, in _validate
[2019-05-24 15:03:00,005] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - match_result = FileSystems.match([pattern], limits=[1])[0]
[2019-05-24 15:03:00,006] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/filesystems.py", line 187, in match
[2019-05-24 15:03:00,006] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - return filesystem.match(patterns, limits)
[2019-05-24 15:03:00,006] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - File "/usr/local/lib/python2.7/site-packages/apache_beam/io/filesystem.py", line 723, in match
[2019-05-24 15:03:00,006] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - raise BeamIOError("Match operation failed", exceptions)
[2019-05-24 15:03:00,007] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,000] {bash_operator.py:101} INFO - apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://mybucket/learning/pack_operation/20190524_1_0_0/extracted-000000000000.json': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)}
[2019-05-24 15:03:00,365] {base_task_runner.py:98} INFO - Subtask: [2019-05-24 15:03:00,363] {bash_operator.py:105} INFO - Command exited with return code 1

I'm having a BeamIOError

apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://mybucket/learning/pack_operation/20190524_1_0_0/extracted-000000000000.json': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)}
Command exited with return code 1

Any help would be very helpful. The whole code was working great before upgrading to apache-beam 2.12.0 (using apache-beam 2.5.0). The most intersting part is

TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)

I read the code for the first function triggering the exception but I couldn't get any clue about issue

Pablo
  • 10,425
  • 1
  • 44
  • 67
MassyB
  • 1,124
  • 4
  • 15
  • 28
  • 1
    Try running a python program that directly calls filesystems.match(['gs://1e42-analytics_data/learning/pack_operation/20190524_1_0_0/extracted-000000000000.json'], limits=[1]). If that fails, could you please print out and provide the stacktraces for the exceptions that are part of the BeamIOError raised here: https://github.com/apache/beam/blob/fab12c772d461fc8db4b3c361d38fe2781926fff/sdks/python/apache_beam/io/filesystem.py#L723 ? – Lukasz Cwik May 24 '19 at 18:36
  • 2
    I just did a test with [this code](https://gist.github.com/gxercavins/0d336c0f47ee9156e141e8f13c98b682) and I can match the public files in `gs://dataflow-samples/shakespeare/*.txt` with no problem using `apache-beam[gcp]==2.12.0` – Guillem Xercavins May 24 '19 at 18:41
  • 1
    Printing out the exceptions in BeamIOError would help locate who is the caller and what is being called that has the problematic __init__ – Lukasz Cwik May 24 '19 at 18:47
  • @LukaszCwik I upgraded google-apitools library and I could read the file (and start the job). here is the post that helped me https://stackoverflow.com/questions/55630755/input-of-apache-beam-examples-wordcount. However my apache-beam job is stuck in the worker instantiation step, here is the error message ` Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support` – MassyB May 27 '19 at 14:32
  • This is really odd. It looks like it could be a bug. Would you consider using `apache_beam.io.fileio.MatchFiles` to find files? That may help you work around your issue – Pablo May 28 '19 at 21:46

0 Answers0