0

I have a workflow in Databricks called "score-customer", which I can run with a parameter called "--start_date". I want to make a job for each date this month, so I manually create 30 runs - passing a different date parameter for each run. However, after 5 concurrent runs, the rest of the runs fail with:

Unexpected failure while waiting for the cluster (1128-195616-z656sbvv) to be ready.

I want my jobs to wait for the cluster to become available instead of failing, how is this achieved?

Average_guy
  • 509
  • 4
  • 16
  • not sure it is related, but is all the 30 runs made at the same time ? are they all using the same cluster ? is the number of concurrent runs set to match the number of runs say 30 ? Do they all run in sequence ? Will having a logic waiting for the cluster or merely a sleep command work out ? – rainingdistros Nov 30 '22 at 04:31
  • All 30 runs are made right after eachother. They all use the same deployment file, in which I have defined a static cluster. In the job page, I have specified the maximum concurrent runs to be 30. Further investigation of the error message tells me my subnet does not have enough capacity for 1 IP addresses, so after x-amount of runs i reach a limit, which results in the rest of the runs failing. Perhaps I could use a sleep command, but I would like to keep the production code as is. – Average_guy Nov 30 '22 at 10:29
  • Maybe I should rephrase my question to "What is the best practice for creating multiple jobs with the objective of backlogging data". Do I need to change the production code to e.g take a list of dates instead of a single date, or do Databricks have some smart command/UI for this purpose? – Average_guy Nov 30 '22 at 10:29
  • thank you for your response...Considering that you are running all the jobs one after the other, and that you are hitting an IP limit, can I assume that you are using job clusters ? Would it be possible for all 30 jobs to use the same job cluster - Job cluster re-use was introduced recently as far as I know...Alternatively your idea of taking dates as a list and looping through them should also work - either way you will not hit the IP limit... – rainingdistros Dec 01 '22 at 05:20

0 Answers0