5

I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the spark-submit parameters below (dependencies are on the PEX file), but I am getting an exception that the PEX file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver & every executor, so I am confused as to why I am encountering this issue.

Config

[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]

Standard Error

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 14 more

What I have tried

Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:

  • Adding the PEX via the --files option in Spark submit
  • Adding the PEX via the the spark.files config when starting up the actual cluster
  • Putting the PEX in DBFS (as opposed to s3)
  • Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)

Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

Note also, the following spark submit command works on AWS EMR:

'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                "spark-submit",
                "--deploy-mode", "cluster", 
                "--master", "yarn",
                "--files", "s3://some_path/my_pex.pex", 
                "--conf", "spark.pyspark.driver.python=./my_pex.pex",
                "--conf", "spark.executorEnv.PEX_ROOT=./tmp",
                "--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
                "s3://some_path/main.py",
                "--some_arg", "some-val"
            ]

Any help would be much appreciated, thanks.

r_g_s_
  • 224
  • 1
  • 8
  • did you try copying the pex file to the driver with an init script first? I don't think you can reference S3 paths in --files, and if you haven't copied the pex file to the driver on init then `"--conf","spark.pyspark.python=./my_pex.pex",` is going to be pointing to a non-existent file too – zyd Apr 07 '23 at 16:12
  • @zyd the code sample (with --files having an s3 path) works with EMR, so I'm assuming it should work in Databricks as well – r_g_s_ Sep 03 '23 at 01:25

0 Answers0