0

I'm trying to read multiple datastore kinds from default namespace in my python pipeline and want to work on them. The functions that I wrote works fine locally with DirectRunner but when I run the pipline on cloud using DataflowRunner, One of the kinds(which contains 1500records) is read very fast while the other one(which contains millions of records) is read very very slow.

for reference when I just tried to read one kind (which contains millions of records) in the pipline it took 10mins but when executing both of them together it took almost 1 hour and still it had just processed 1/10th of records.

I'm not able to figure out what the problem is.

This is my code

def read_from_datastore(project,user_options, pipeline_options):
  p = beam.Pipeline(options=pipeline_options)
  query = query_pb2.Query()
  query.kind.add().name = user_options.kind   #reading 1st kind this is the one with million records

  students = p | 'ReadFromDatastore' >> ReadFromDatastore(project=project,query=query)

  query = query_pb2.Query()
  query.kind.add().name = user_options.kind2   #reading 2nd kind this is the one with 1500 records

  courses = p | 'ReadFromDatastore2' >> ReadFromDatastore(project=project,query=query)

  open_courses = courses | 'closed' >> beam.FlatMap(filter_closed_courses)
  enrolled_students = students | beam.ParDo(ProfileDataDumpDataFlow(),AsIter(open_courses))

Let me know if anyone has any idea as to why that happens.

  • Can you share the [pipeline options](https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options) you specified? Especially num_workers, max_num_workers and machine_type. – Yurci Aug 23 '18 at 13:50
  • You can look for [this](https://cloud.google.com/dataflow/examples/cookbook#joinexamples) example how to do relational joins using Dataflow. – Yurci Aug 31 '18 at 10:28
  • Hey @Yurci the pipeline options are the default Google Dataflow options no modifications to them. – sagar kothari Sep 02 '18 at 05:22

1 Answers1

0

I see that you are making a join operation of two kinds. For this purpose it will be more suitable,quicker if you export entities to a bucket then load it to BigQuery. Make the required join operation within BigQuery.

It is not reading entities taking time in your job it is the join operation.

Yurci
  • 556
  • 2
  • 9
  • I don't think Join is the problem here. When I try to just read both the entities together after commenting the the ParDo and just print them directly the situation is still the same. Courses which are around 1500 are read almost Instantly while the Students which have millions of records take a lot of time. Alternatively when I hardcode some 200-300 courses and run it then the Join works perfectly and I get the output in about 15-20 mins for all the Student Data. – sagar kothari Sep 02 '18 at 05:26