3

We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.

Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.

Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.

Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?

Is there an alternate approach that can fulfill a similar use case?

WarSame
  • 870
  • 1
  • 10
  • 24
  • Please help me understand your use case little mode . 1. why you are trying to install packages in node level rather than cluster level ? 2. why can't we use notebook level packages . ref : https://docs.databricks.com/libraries/notebooks-python-libraries.html – Karthikeyan Rasipalay Durairaj Dec 05 '21 at 04:05
  • Thanks for your reply. 1. Installing them at the cluster level is fine as well. I would just like the libraries to be installed before trying to run jobs on the nodes, so as long as that is done I'm happy with whichever approach. I didn't think cluster level applied here since these are nodes for a Job from a Pool. 2. I don't think this applies to notebook level packages because these are jobs, not notebooks. It also seems this install happens when the code runs rather than when the node initializes. – WarSame Dec 05 '21 at 04:10

2 Answers2

2

You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:

  • Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of preloaded_docker_images attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)
  • Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
    • for Python this could be done with pip download --prefer-binary lib1 lib2 ...
    • for Java/Scala you can use mvn dependency:get -Dartifact=<maven_coordinates>, that will download dependencies into ~/.m2/repository folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/ command
    • for R, it's slightly more complicated, but is also doable
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • For both of these options could you verify for me that they are done when the instance starts up(i.e. enters Idle state) rather than when a job first starts running on these instances? I had read a little about these in the documentation but couldn't find anything saying that one way or the other. – WarSame Dec 05 '21 at 16:15
  • 1
    init script & loading of the container happens when node is going from idle, to running state. Nothing happens in the idle state. There is always an overhead with loading of Docker image, DBR into it, init script, etc. By preloading DBR & Docker you can make this period shorter – Alex Ott Dec 05 '21 at 16:53
  • Thanks for your response Alex. It's interesting that it happens when moving from Idle to Running. I would have thought Docker would run when first starting up an instance. Could you give an example of loading period improvements? I'm trying to determine if this will be worth the development effort – WarSame Dec 05 '21 at 17:18
  • 1
    Basically, when you use instance pools, cluster startup looks as follows: 1. get node from the pool; 2. load Docker image; 3. put DBR into Docker; 4. execute init script; 5. start Spark processes. By preloading docker & DBR you shorten items 2 & 3. By using binary packages you shorten 4. Real improvements depends on the number of libraries, etc., but you can measure it. – Alex Ott Dec 05 '21 at 17:22
  • 1
    I saw that for really short tasks, some of the customers are using interactive clusters - they pay more per DBU, but get really fast execution because you don't need to initialize anything. But there could be problems if tasks are using different versions of libraries, plus potentially problems because of the other processes running simultaneously – Alex Ott Dec 05 '21 at 17:23
  • But I really recommend to find if you have a solution architect assigned to your account and discuss this topic with this person. – Alex Ott Dec 05 '21 at 17:24
  • Thanks for the replies! It's unfortunate that there is no possibility to do what I hoped but good to have the options laid out. We have an upcoming call with a Solution Architect and will ask for their opinion there. – WarSame Dec 06 '21 at 14:17
1

In addition to Alex Ott's solution, if you are using a Terraform based solution you can add a requirements.txt file where you need to add all required python libraries. If there is any Maven/Java libraries needed to be installed in the cluster you can add them as a list variable in your Terraform code. Then use code like below

    dynamic "library" {
    for_each = toset(split("\n", file("./requirements.txt")))

    content {
      pypi {
        package = library.value
        repo    = "if_there_is_any_repo"
      }
    }
  }
Diego Borba
  • 1,282
  • 8
  • 22
Saikat
  • 11
  • 3