I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
- in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
- To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?