I am using databricks for a specific workload. This workload involves an approx of 10 to 200 dataframes that are read and written to a storage location. This workload can benefit from parallelism.
The constraint i have is cost optimization. Instances are billed for a minimum of 1 hr hence if a workload takes less than 1 hr then i am loosing money. Also all jobs should be completed in less than 1hr.
The databricks way to cost optimization is autoscaling but what happening is: (Consider a constance instance type)
If i don't use autoscaling and use only 1 worker, then for 50 dataframes, the job takes 30 min but for 200 dataframes, it takes 2 hr. 2hr is not acceptable so it would make sense increase number of workers. If i increase the number of workers to 3 then now, the 200 dataframes takes 45 min but the 50 dataframe job takes only 12 min to num. This would be a problem cause, instances are billed for minimum of 1 hr. Hence i am loosing a lot of money for those 50 dataframes jobs.
To over come the above, one would say to use autoscaling but what happens is when the 50 dataframe job starts, after 5 min, databricks autoscales the cluster from 1 worker (i had set the minimum workers to 1) to 5 workers(databricks scales up in steps of 4 workers). And then the job finishes in under 15 min, so again i am loosing money. This works like a charm for larger workloads but most of the job are small. Also the 1hr time limit is a hard limit so the job should not run more than 1 hr.
Any thoughts how to overcome this.
Here is some things that is tried or searched for
Setting the number of instances before hand -> This wont be possible cause the size off the workload can only be determined after the job starts.
Manipulating number of executors and cores per executors using
.config("spark.executor.cores", "1").config("spark.executor.instances", "1")
Didnt workControlling autoscalling from the driver code -> Not possible for databricks
P.S. -> I am scheduling my job in databricks and the driver uses pyspark to run the worload