I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
- web_logs
- \input (subdirectory)
- \output (subdirectory)
- network_logs (same subdirectories as web_logs)
- system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!