I'm using Google Cloud Dataflow for Python SDK to read in 200k+ entities from datastore using the ReadFromDatastore()
function on a query without any filters.
def make_example_entity_query():
"""
make an unfiltered query on the `ExampleEntity` entity
"""
query = query_pb2.Query()
query.kind.add().name = "ExampleEntity"
return query
I'm then doing some work in the pipeline with this query
p = beam.Pipeline(options=PipelineOptions.from_dictionary(pipeline_options))
(
p
| 'read in the new donations from Datastore'
>> ReadFromDatastore(project, query, None)
|'protobuf2entity transformation'
>> beam.Map(entity_from_protobuf)
| 'do some work or something'
>> beam.Map(lambda item: item[0] + item[1])
)
return p.run()
this runs fine locally using testing data on the order of a few thousand entries, but when I deploy it to the cloud and run it on our production database with 200k+ items it simply times out after an hour or so without making any progress. It seems to be entirely stuck on the read portion.
also it shows that zero items were read
and it appears that only a single worker was ever spun up
So I'm not really sure what's going on here. My questions are
- is there some reasonable limit to the amount of data that can be read in from datastore as an input to the pipeline?
- why is there seemingly no data making it into the pipeline at all? If I run this locally I can see the data making it through, although quite slowly.
- why is there only a single worker spinning up? I know if you have filters on the read operation it causes the read to be done from a single node but this is done with no inequality filters on the read from datastore.