I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using beam.Create(features)
. There are two ways I can create the initial feature list:
- Export each feature as a json string that subsequent
DoFn
s will need to parse into a dict:
features = [f.ExportToJson() for f in layer]
- Export a python dict pre-parsed from the JSON string
features = [json.loads(f.ExportToJson()) for f in layer]
When using option 1, beam.Create(features)
takes a minute or so and the pipeline continues. Using option 2, beam.Create(features)
takes like 3+ hours on a 6-core i7, and seems to spend a lot of time in here:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in <listcomp>
typehints.Union[[instance_to_type(v) for k, v in o.items()]],
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in instance_to_type
typehints.Union[[instance_to_type(v) for k, v in o.items()]],
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/typehints/trivial_inference.py", line 88, in <listcomp>
Is this trivial_inference
what is slowing down beam.Create
when passing in a list of dicts? Can I configure beam.Create
to not do whatever it's trying to do in there, or otherwise speed it up so a list of dicts isn't 100x slower vs. a list of strings?