0

I'm using Google Cloud Dataflow for Python SDK to read in 200k+ entities from datastore using the ReadFromDatastore() function on a query without any filters.

def make_example_entity_query():
    """
    make an unfiltered query on the `ExampleEntity` entity
    """
    query = query_pb2.Query()
    query.kind.add().name = "ExampleEntity"
    return query

I'm then doing some work in the pipeline with this query

p = beam.Pipeline(options=PipelineOptions.from_dictionary(pipeline_options))
(
        p
        | 'read in the new donations from Datastore'
        >> ReadFromDatastore(project, query, None)
        |'protobuf2entity transformation'
        >> beam.Map(entity_from_protobuf)
        | 'do some work or something'
        >> beam.Map(lambda item: item[0] + item[1])
)
return p.run()

this runs fine locally using testing data on the order of a few thousand entries, but when I deploy it to the cloud and run it on our production database with 200k+ items it simply times out after an hour or so without making any progress. It seems to be entirely stuck on the read portion.

enter image description here

also it shows that zero items were read

enter image description here

and it appears that only a single worker was ever spun up

enter image description here

So I'm not really sure what's going on here. My questions are

  1. is there some reasonable limit to the amount of data that can be read in from datastore as an input to the pipeline?
  2. why is there seemingly no data making it into the pipeline at all? If I run this locally I can see the data making it through, although quite slowly.
  3. why is there only a single worker spinning up? I know if you have filters on the read operation it causes the read to be done from a single node but this is done with no inequality filters on the read from datastore.
John Allard
  • 3,564
  • 5
  • 23
  • 42
  • It's hard to exactly know the reason for job being stuck without looking closely at the job. Can you try contacting Google Cloud Support with your job ID ? https://cloud.google.com/support/ – chamikara Apr 08 '19 at 14:51

1 Answers1

0

This is being addressed in a Github issue. So please refer to that.

chamikara
  • 1,896
  • 1
  • 9
  • 6