0

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.

In particular, the issues are this:

  1. in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
  2. To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?

Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?

speedplane
  • 15,673
  • 16
  • 86
  • 138

1 Answers1

1
  1. In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.

  2. In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.

I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.

Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

Daniel Oliveira
  • 1,361
  • 4
  • 7
  • Beam is sadly, significantly more complex to setup and use than mapreduce. I know Beam provides much more power and flexibility, but I'm going to miss mapreduce. – speedplane Jan 26 '21 at 04:49