Databricks PySpark with PEX: how can I configure a PySpark job on Databricks using PEX for dependencies?

Question

I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the spark-submit parameters below (dependencies are on the PEX file), but I am getting an exception that the PEX file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver & every executor, so I am confused as to why I am encountering this issue.

Config

[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]

Standard Error

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 14 more

What I have tried

Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:

Adding the PEX via the --files option in Spark submit
Adding the PEX via the the spark.files config when starting up the actual cluster
Putting the PEX in DBFS (as opposed to s3)
Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)

Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

Note also, the following spark submit command works on AWS EMR:

'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                "spark-submit",
                "--deploy-mode", "cluster", 
                "--master", "yarn",
                "--files", "s3://some_path/my_pex.pex", 
                "--conf", "spark.pyspark.driver.python=./my_pex.pex",
                "--conf", "spark.executorEnv.PEX_ROOT=./tmp",
                "--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
                "s3://some_path/main.py",
                "--some_arg", "some-val"
            ]

Any help would be much appreciated, thanks.

did you try copying the pex file to the driver with an init script first? I don't think you can reference S3 paths in --files, and if you haven't copied the pex file to the driver on init then `"--conf","spark.pyspark.python=./my_pex.pex",` is going to be pointing to a non-existent file too — zyd, Apr 07 '23 at 16:12
@zyd the code sample (with --files having an s3 path) works with EMR, so I'm assuming it should work in Databricks as well — r_g_s_, Sep 03 '23 at 01:25

Databricks PySpark with PEX: how can I configure a PySpark job on Databricks using PEX for dependencies?

0 Answers0