ReadFromDatastore operation timing out on 200k+ entity read with no inequality filters, no data making it into pipeline

Question

I'm using Google Cloud Dataflow for Python SDK to read in 200k+ entities from datastore using the ReadFromDatastore() function on a query without any filters.

def make_example_entity_query():
    """
    make an unfiltered query on the `ExampleEntity` entity
    """
    query = query_pb2.Query()
    query.kind.add().name = "ExampleEntity"
    return query

I'm then doing some work in the pipeline with this query

p = beam.Pipeline(options=PipelineOptions.from_dictionary(pipeline_options))
(
        p
        | 'read in the new donations from Datastore'
        >> ReadFromDatastore(project, query, None)
        |'protobuf2entity transformation'
        >> beam.Map(entity_from_protobuf)
        | 'do some work or something'
        >> beam.Map(lambda item: item[0] + item[1])
)
return p.run()

this runs fine locally using testing data on the order of a few thousand entries, but when I deploy it to the cloud and run it on our production database with 200k+ items it simply times out after an hour or so without making any progress. It seems to be entirely stuck on the read portion.

also it shows that zero items were read

and it appears that only a single worker was ever spun up

So I'm not really sure what's going on here. My questions are

is there some reasonable limit to the amount of data that can be read in from datastore as an input to the pipeline?
why is there seemingly no data making it into the pipeline at all? If I run this locally I can see the data making it through, although quite slowly.
why is there only a single worker spinning up? I know if you have filters on the read operation it causes the read to be done from a single node but this is done with no inequality filters on the read from datastore.

It's hard to exactly know the reason for job being stuck without looking closely at the job. Can you try contacting Google Cloud Support with your job ID ? https://cloud.google.com/support/ — chamikara, Apr 08 '19 at 14:51

score 0 · Answer 1 · answered Apr 08 '19 at 18:18

0

This is being addressed in a Github issue. So please refer to that.

answered Apr 08 '19 at 18:18

chamikara

1,896
1
9
6

ReadFromDatastore operation timing out on 200k+ entity read with no inequality filters, no data making it into pipeline

1 Answers1