4

I have some complex Oozie workflows to migrate from on-prem Hadoop to GCP Dataproc. Workflows consist of shell-scripts, Python scripts, Spark-Scala jobs, Sqoop jobs etc.

I have come across some potential solutions incorporating my workflow scheduling needs:

  1. Cloud Composer
  2. Dataproc Workflow Template with Cloud Scheduling
  3. Install Oozie on Dataproc auto-scaling cluster

Please let me know which option would be most efficient in terms of performance, costing and migration complexities.

Balajee Venkatesh
  • 1,041
  • 2
  • 18
  • 39

1 Answers1

5

All 3 are reasonable options (though #2 Scheduler+Dataproc is the most clunky). A few questions to consider: how often do your workflows run, how tolerant are you to unused VMs, how complex are your Oozie workflows, and how willing are you to invest time into migration?

Dataproc's workflows support branch/join but lack other Oozie features such as what to do on job failure, decision nodes, etc. If you use any of these, I'd would not even consider a direct migration to Workflow Templates and choose either #3 or the hybrid migration below.

A good place to start, would be hybrid migration (this is assuming your clusters are sparsely used). Keep your Oozie workflows and have Composer + Workflow Templates create a cluster with Oozie, use init action to stage your Oozie XML files + job jars/artifacts, add a single pig sh job from a Workflow to trigger Oozie via CLI.

tix
  • 2,138
  • 11
  • 18
  • Does Google recommend opting for Oozie installation on Dataproc for handling workflow challenges? We are looking forward to lift and shift kind of approach for migration. These are high frequency complecated workflows. Also, we would need to have the clusters up and running for 24×7 due to business needs. – Balajee Venkatesh Dec 02 '19 at 17:42
  • 1
    Could you explain `workflow challenges` a bit? If your goal is lift and shift (from DC to Cloud) then #3 is a no-brainer. My only advice would be to examine ways you can avoid having long running persistent clusters (you won't be getting OS or OSS updates for instance). Eventually, you will have to upgrade Dataproc image version which will be very painful without automation in place. – tix Dec 02 '19 at 20:26
  • Sometimes we see a very high rate of resource consumption while running our existing Oozie workflows. Due to this our processing slows down. Sometimes we also see lots of job stuck for a long time which gives rise to the need of killing those jobs and run again. We just want to make sure it doesn't repeat when we migrate to cloud. Dataproc has **auto-scaling** feature generally available now. I believe we can create an auto-scaling cluster so that when highly utilised by Oozie workflows, it can automatically scale-up to accomodate the running jobs. Please let me know your thoughts. – Balajee Venkatesh Dec 03 '19 at 05:21
  • 1
    This makes sense. – tix Dec 03 '19 at 16:12
  • 1
    There's also Oozie to Airflow converter: https://cdn.oreillystatic.com/en/assets/1/event/292/Migrating%20Apache%20Oozie%20workflows%20to%20Apache%20Airflow%20Presentation.pdf – tix Dec 04 '19 at 00:23
  • Thank you so much for all your suggestions and references. – Balajee Venkatesh Dec 04 '19 at 04:05