Questions tagged [spark-submit]

spark-submit is a script that is able to run apache-spark code written in e.g. java, scala or python

More information about spark-submit can be found here.

611 questions
4
votes
4 answers

ClassNotFoundException when submitting JAR to Spark via spark-submit

I'm struggling to submit a JAR to Apache Spark using spark-submit. To make things easier, I've experimented using this blog post. The code is import org.apache.spark.SparkContext import org.apache.spark.SparkConf object SimpleScalaSpark { def…
dommer
  • 19,610
  • 14
  • 75
  • 137
4
votes
1 answer

Failed to delete temp files after running spark-submit on windows 7

I'm using the code in this example to run a scala program using spark. The program executes fine, but when the StreamingContext tries to stop I get this error: java.io.IOException: Failed to delete:…
3
votes
0 answers

spark-submit sends wrong java path to driver

I'm submitting a job to a containerised spark cluster running locally. Spark version 3.2.1. I'm using bitnami's spark container images. The job is written in scala. I've created a 'fat-jar'. Now when I submit the jar to the cluster (from my local,…
o_O
  • 341
  • 3
  • 14
3
votes
0 answers

Copy src code ZIP to Dataproc cluster from GCS in Spark-Submit

I am trying to run a spark job on the Dataproc cluster in GCP. Where all my src code is zipped and stored in the GCS bucket. Additionally, I have the main python file and additional jars in the GCS bucket itself. Now, when I try to do spark-submit,…
3
votes
0 answers

pyspark: ModuleNotFoundError: No module named 'app' because PySpark serializer is not able to locate paython package folder 'app'

I am reading a csv into dataframe in PySpark using below code snippet. project structure: pyspark-debug: app __init__.py data-pipeline.py main.py .... data-pipeline.py from pyspark.sql import SparkSession, DataFrame from…
3
votes
1 answer

gcloud spark submit:Path does not exist: hdfs://cluster-xxxx-m/user/root/--;

I am trying to use gsutil to submit my spark job from Airflow. This is my gcloud command: gcloud dataproc jobs submit spark --cluster=xxx --region=us-central1 --class=com.xxx --jars=gs://xxx/xxx/xxx.jar -- xxx -- xxx -- xxx -- gs://xxx/xxx/xxx I am…
3
votes
2 answers

How can I reach a spark cluster in a Docker container with spark-submit and a python script?

I've created a Spark cluster with one master and two slaves, each one on a Docker container. I launch it with the command start-all.sh. I can reach the UI from my local machine at localhost:8080 and it shows me that the cluster is well launched…
Colin Defever
  • 33
  • 1
  • 5
3
votes
0 answers

How to avoid showing some secret values in sparkUI

I am passing some secret keys in spark-submit command. I am using below to redact the key: --conf 'spark.redaction.regex='secret_key' though it is working,the secret_key is visible in sparkUI during job execution.The redaction takes place at the…
3
votes
1 answer

Airflow: trigger Spark in different Docker container

I have both Airflow 2 (the official image) and Apache Spark running in a docker-compose pipeline. I would like to execute a DAG triggering a Spark script by means of the SparkSubmitOperator…
Requin
  • 467
  • 4
  • 16
3
votes
1 answer

Spark - Error: Failed to load class - spark-submit

I create sbt project with Intellij and build Artifacts to jar file. I put jar file to server and submit, but I got this error: spark-submit --master spark://master:7077 --class streaming_process spark-jar/spark-streaming.jar Error: Failed to load…
3
votes
0 answers

Spark giving multiple datasource error on saving parquet file

I am trying to learn spark and scala, on my trying to write the dataframe object of my result to parquet file by calling the parquet method, i am getting error as such Code Base that…
thickGlass
  • 540
  • 1
  • 5
  • 19
3
votes
0 answers

Spark on Yarn Number of Cores in EMR Cluster

I have an Emr cluster for spark with below configuration of 2 Instances. r4.2xlarge 8 vCore So my total vCores is 16 and the same is reflected in yarn Vcores I have submitted a spark streaming job with parameters --num-executors 2 --executor-cores…
3
votes
1 answer

Spark-submit on kubernetes, executor pods are still running even after spark job is finished. Due to which the resources are not free for new jobs

We are submitting spark job into kubernetes cluster using cluster mode and with some more memory configurations. My job is finishing in about 5 mins but my executor pods are still running after 30 - 40 mins. Due to this the new jobs are pending as…
3
votes
0 answers

Pyspark connecting to sqlserver using pyodbc fails in cluster mode (deploy-mode)

Question 1: I have this piece of code which works well when run in spark deploy mode : CLIENT, but throws exception when I run the same code in cluster mode. I went though most of the SO question on this topic, but didnt find a solution I'm using…
Rohit Nimmala
  • 1,459
  • 10
  • 28
3
votes
1 answer

How can I configure spark-submit (or DataProc) to download maven dependencies (jars) from GitHub packages?

I am trying to get spark-submit (via GCP DataProc) to download maven dependencies from a GitHub packages repository. Adding spark.jars.repositories=https://myuser:mytoken@maven.pkg.github.com/myorg/my-maven-packages-repo/ to the spark-submit command…