4

Im trying to deploy spark (pyspark) in kubernetes using spark-submit, but I'm getting the following error :

Exception in thread "main" org.apache.spark.SparkException: Please specify spark.kubernetes.file.upload.path property.    at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:330)    at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:300)    at

Since i'm packing my dependencies trhough a virtual environment, I don't have the need of specify a remote cluster to retrieve them, so I'm no setting the parameter spark.kubernetes.file.upload.path

I tried to include that parameter anyway, leaving an empty value, but it does'nt work.

My spark-submit command (which I trigger from a python script) is as follows:

   cmd = f""" {SPARK_HOME}/bin/spark-submit
                                --master {SPARK_MASTER}
                                --deploy-mode cluster
                                --name spark-policy-engine
                                --executor-memory {EXECUTOR_MEMORY} \
                                --conf spark.executor.instances={N_EXECUTORS} 
                                --conf spark.kubernetes.container.image={SPARK_IMAGE}
                                --conf spark.kubernetes.file.upload.path=''
                                --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 
                                --archives pyspark_venv.tar.gz#environment {spark_files}
                                --format_id {format_id}
                                """

As shown I'm including the parameter with within a --conf tag (as shown in https://spark.apache.org/docs/3.0.0-preview/running-on-kubernetes.html#:~:text=It%20can%20be%20found%20in,use%20with%20the%20Kubernetes%20backend.&text=This%20will%20build%20using%20the%20projects%20provided%20default%20Dockerfiles%20.), but wether is present or not, it just doesn't work

Koedlt
  • 4,286
  • 8
  • 15
  • 33

1 Answers1

3

You need to specify a real path not an empty string, let's say in your image you have a tmp folder under /opt/spark, then the conf should be set like this:

--conf spark.kubernetes.file.upload.path='local:///opt/spark/tmp'

If you don't want to use the

   cmd = f""" {SPARK_HOME}/bin/spark-submit
                                --master {SPARK_MASTER}
                                --deploy-mode cluster
                                --name spark-policy-engine
                                --executor-memory {EXECUTOR_MEMORY} \
                                --conf spark.executor.instances={N_EXECUTORS} 
                                --conf spark.kubernetes.container.image={SPARK_IMAGE}
                                --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 
                                --archives pyspark_venv.tar.gz#environment {spark_files}
                                --format_id {format_id}
                                local:///opt/spark/work-dir/xxx.jar
                                """
Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
  • it throws `Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "local"` – Rodrigo Alarcón Mar 01 '23 at 14:39
  • @RodrigoAlarcón this is another issue with your hadoop configuration,check this answer https://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file – Abdennacer Lachiheb Mar 01 '23 at 17:26
  • did you try to point to your local system ex: --conf spark.kubernetes.file.upload.path='c:' – Abdennacer Lachiheb Mar 01 '23 at 17:27
  • 2
    I'm running on a docker container. Reference to the absolute path doesn't work either. What baffles me is that the documentation state about this parameter (supossed dependencies stored in s3) "The app jar file will be uploaded to the S3 and then when the driver is launched it will be downloaded to the driver pod and will be added to its classpath"... I'm using pyspark, so my dependencies are in a tarball. Should I provide a s3 host to my dependencies? there's no other option? – Rodrigo Alarcón Mar 01 '23 at 19:37
  • Hi Author of this question, which version of spark you are using ? Is it 3.4.0 ? – user9920500 Jun 21 '23 at 14:35