I'm trying to read multiple datastore kinds from default namespace in my python pipeline and want to work on them. The functions that I wrote works fine locally with DirectRunner but when I run the pipline on cloud using DataflowRunner, One of the kinds(which contains 1500records) is read very fast while the other one(which contains millions of records) is read very very slow.
for reference when I just tried to read one kind (which contains millions of records) in the pipline it took 10mins but when executing both of them together it took almost 1 hour and still it had just processed 1/10th of records.
I'm not able to figure out what the problem is.
This is my code
def read_from_datastore(project,user_options, pipeline_options):
p = beam.Pipeline(options=pipeline_options)
query = query_pb2.Query()
query.kind.add().name = user_options.kind #reading 1st kind this is the one with million records
students = p | 'ReadFromDatastore' >> ReadFromDatastore(project=project,query=query)
query = query_pb2.Query()
query.kind.add().name = user_options.kind2 #reading 2nd kind this is the one with 1500 records
courses = p | 'ReadFromDatastore2' >> ReadFromDatastore(project=project,query=query)
open_courses = courses | 'closed' >> beam.FlatMap(filter_closed_courses)
enrolled_students = students | beam.ParDo(ProfileDataDumpDataFlow(),AsIter(open_courses))
Let me know if anyone has any idea as to why that happens.