0

I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using beam.Create(features). There are two ways I can create the initial feature list:

  1. Export each feature as a json string that subsequent DoFns will need to parse into a dict:
features = [f.ExportToJson() for f in layer]
  1. Export a python dict pre-parsed from the JSON string
features = [json.loads(f.ExportToJson()) for f in layer]

When using option 1, beam.Create(features) takes a minute or so and the pipeline continues. Using option 2, beam.Create(features) takes like 3+ hours on a 6-core i7, and seems to spend a lot of time in here:

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in <listcomp>
    typehints.Union[[instance_to_type(v) for k, v in o.items()]],
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in instance_to_type
    typehints.Union[[instance_to_type(v) for k, v in o.items()]],
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in <listcomp>

Is this trivial_inference what is slowing down beam.Create when passing in a list of dicts? Can I configure beam.Create to not do whatever it's trying to do in there, or otherwise speed it up so a list of dicts isn't 100x slower vs. a list of strings?

Travis Webb
  • 14,688
  • 7
  • 55
  • 109
  • You may be able to try to run the pipeline with `--no_pipeline_type_check` option to see if it makes a different. – Yichi Zhang Dec 30 '20 at 21:25

1 Answers1

0

very interesting outcome!

My guess is that this happens because Create needs to pickle all of the data that it receives. The pickled size of the dictionaries may be large because they're pickled as Python objects, while strings are pickled as Python strings.

You could do:

p
| beam.Create([f.ExportToJson() for f in layer])
| beam.Map(json.loads)

To avoid the extra pickling. Does that help?

Pablo
  • 10,425
  • 1
  • 44
  • 67