8

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!

deemson
  • 367
  • 4
  • 7

3 Answers3

4

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.

Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.

From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

aqquadro
  • 1,027
  • 10
  • 17
  • thanks! When using streaming execution, a Compute Engine instance with my dataflow job will need to be up and running 24/7, right? – deemson Apr 14 '15 at 14:35
  • 1
    If you use the "cloud execution" I think it start and stop instances when is needed :) – aqquadro Apr 14 '15 at 14:49
1

One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.

A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.

You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.

To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

GavinG
  • 121
  • 3
  • Hi Gavin, our backend is written in Python and we were hoping to use an endpoint to trigger our Dataflow jobs. To use the second option that you are presenting is it best if we make our pipeline a dataflow template? This may be my misunderstanding but if i have all of the pipeline setup including pipeline.run() in App Engine would it still run asynchronously on Google Cloud Platform and not time out in App Engine? Thank you – T.Okahara Mar 02 '17 at 18:46
0

There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.

Jeremy Lewi
  • 6,386
  • 6
  • 22
  • 37
  • thanks for the response! I didn't quite understand what did you mean by "is not something that's actively supported". Is it supported, but badly? Or it is not supported at all using "clean and officially suggested way"? – deemson Apr 14 '15 at 14:20
  • I meant it might be possible but it is not supported and I can't guarantee it will work. – Jeremy Lewi Apr 14 '15 at 19:17