1

I've created a Pipeline using the Scio wrapper for Apache Beam. I want to deploy it in Google Dataflow.

I want for there to be a specific button or endpoint or Function that will execute this job regularly.

All of the instructions I can find involved running sbt runMain/pack, which builds the artifacts and uploads them each and every time.

How can I upload the artifacts once, and then create a job based on the pipeline as easily as possible?

Gabriel Curio
  • 613
  • 4
  • 7

3 Answers3

4

At Spotify, the way we dealt with this was to create a docker image for Scio pipeline and execute that image via Styx, which is basically a k8s based cron, but you could execute it via your good old cron too (or airflow/luigi/gcp-composer) whatever fits your use case best. Beam has build in caching mechanism to cache dependencies, so consecutive runs just reuse previously uploaded files. Scio also supports Dataflow templates mentioned in the other answer.

rav
  • 3,579
  • 1
  • 18
  • 18
  • What would the cronjob do? Execute the MyPipeline.main every time? My hope is that I can deploy the Pipeline to a given Runner, and create new jobs without directly messing with local MyPipeline.main again. It seems Beam itself doesn’t support that . . . I’d need to use Dataflow templates or a Docker image. There’s no concept of a Job in Beam? – Gabriel Curio Oct 24 '18 at 18:04
  • 1
    @GabrielCurio yep, probably with an argument that is the current date (or whatever you need). So if the pipeline is in a docker image with an entry point executing jar of the pipeline, you would have a cron that runs "docker run ". – rav Oct 24 '18 at 18:06
  • @GabrielCurio regarding "templating" a job, please take a look at the Dataflow template doc in https://stackoverflow.com/a/52968062/1661491, other than that executing main might be the way to go. – rav Oct 24 '18 at 18:18
1

I don't know how exactly this would work with Scio but, generally, you can create a custom Dataflow template and then execute it through the console, API calls, gcloud commands or client libraries.

In case you want to execute it periodically, you could create a Cron job that executes it by using the client libraries.

Iñigo
  • 2,500
  • 2
  • 10
  • 20
  • When I create a template, the field "filesToStage" lists files on my local desktop. Will the job described in the template be able to run without access to those files? Wouldn't they have to be uploaded, and the location in the cloud specified? – Gabriel Curio Oct 24 '18 at 13:18
  • 1
    You can store them in [Google Cloud Storage](https://cloud.google.com/storage/). In the documentation I sent about creating templates you can see an example where the text of King Lear was stored in a GCS bucket. – Iñigo Oct 24 '18 at 13:23
1

My issue is resolved. Here's my advice to my previous self:

  1. No, Apache Beam does not have any built in deployment features beyond running a Pipeline.
  2. Any sort of a Job (a Pipeline run within a specific context) would have to be provided by the system where the Runner is operating on.
  3. Dataflow offers such a thing: the Template. Templates let you turn a Pipeline into a Job with the click of a button.
  4. The template itself is a JSON document.
  5. You can provide the Template with args via a User Interface (if you use ValueProvider objects), or allow assign the args in the Template JSON file.
  6. You can autogenerate a template file by adding --templateLocation=gs://mybucket/templateName.json to the program args.
  7. The Template JSON file contains lots of scary things, like those "filesToStage".
  8. Don't worry about the stuff you don't understand. "filesToStage" probably exists to make sure the artifacts are properly deployed . . . thus the references to your local drive.
  9. Permissions can be a problem the first time around.
  10. There's a nasty bug in Beam/Scio that will cause Beam to "forget" about the Google filesystem "gs://" type. Fix it by running FileSystems.setDefaultPipelineOptions(PipelineOptionsFactory.create)
  11. Use Google Functions to activate the job. There's a very nice template on how to do that on Google's website.

@ravwojdyla and @Iñigo - thank you both for your help.

Gabriel Curio
  • 613
  • 4
  • 7