2

I am creating a Google dataflow pipeline, using Apache Beam Java SDK. I have a few transforms there, and I finally create a collection of Entities ( PCollection< Entity > ) . I need to write this into the Google DataStore and then, perform another transform AFTER all entities have been written. (such as broadcasting the IDs of the saved objects through a PubSub Message to multiple subscribers).

Now, the way to store a PCollection is by: entities.DatastoreIO.v1().write().withProjectId("abc")

This returns a PDone object, and I am not sure how I can chain another transform to occur after this Write() has completed. Since DatastoreIO.write() call does not return a PCollection, I am not able to further the pipeline. I have 2 questions :

  1. How can I get the Ids of the objects written to datastore?

  2. How can I attach another transform that will act after all entities are saved?

rjdkolb
  • 10,377
  • 11
  • 69
  • 89
Venky
  • 396
  • 4
  • 18

1 Answers1

3

We don't have a good way to do either of these things (returning IDs of written Datastore entities, or waiting until entities have been written), though this is far from the first similar request (people have asked for this for BigQuery, for example) and we're thinking about it.

Right now your only option is to wait until the entire pipeline finishes, e.g. via pipeline.run().waitUntilFinish(), and then doing what you wanted in your main program (e.g. you can run another pipeline).

jkff
  • 17,623
  • 5
  • 53
  • 85
  • Thanks for your answer. I shall try it out. Just to confirm my understanding : we can have p1.run().waitUntilFinish() , and then have p2.run() ..in which case pipeline p2 will start after p1 has finished. Is that correct ? – Venky Sep 19 '17 at 09:22
  • I tried it and it worked when I used the BlockingDataflowPipelineRunner But when I use use templates (Ref: https://cloud.google.com/dataflow/docs/templates/overview), I am not sure how to make it work. I guess one template file is associated with one pipeline only ? How can i create a single template that will create a pipeline , execute it, wait for it to finish, and then start another pipeline after it has finished ? – Venky Sep 19 '17 at 10:30
  • This is not possible - 1 template is 1 pipeline. – jkff Sep 19 '17 at 15:32
  • Ok. Thanks for the confirmation, Eugene – Venky Sep 20 '17 at 04:02