0

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:

root_dir

  • web_logs
  • \input (subdirectory)
  • \output (subdirectory)
  • network_logs (same subdirectories as web_logs)
  • system_logs (same subdirectories as web_logs)

The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.

My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.

The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.

My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?

Thanks so much in advance!

Mike Malloy
  • 1,520
  • 1
  • 15
  • 19

2 Answers2

3

Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.

That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.

Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.

A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.

Dennis Huo
  • 10,517
  • 27
  • 43
  • How can you monitor (for example with Prometheus) the clusters when you spin them up on demand? (When you use a pull monitoring approach). – idan ahal Jul 05 '22 at 08:46
1

You might want to explore my solution for Autoscaling Google Dataproc Clusters The source code can be found here

avivl
  • 407
  • 4
  • 4