Questions tagged [spark-submit]

spark-submit is a script that is able to run apache-spark code written in e.g. java, scala or python

More information about spark-submit can be found here.

611 questions
1
vote
1 answer

Spark-submit AWS EMR with anaconda installed python libraries

I launch an EMR cluster with boto3 from a separate ec2 instance and use a bootstrapping script that looks like this: #!/bin/bash ############################################################################ #For all nodes including master …
B_Miner
  • 1,840
  • 4
  • 31
  • 66
1
vote
1 answer

How to run spark-submit in virtualenv for pyspark?

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to…
lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
1
vote
2 answers

Apache Airflow - Spark Submit Failing -When running with master 'yarn-client' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment

I am new to Spark and Airflow and trying to create a DAG which runs spark submit jobs in pyspark. In my Ubuntu system I have a user created named 'hadoopusr' through which I manually run my spark submits. All the environment variables are setup in…
1
vote
1 answer

Python: Pass a pandas Dataframe as an argument to subprocess

How to send a dataframe as an argument to a python script with spark-submit using subprocess. I have tried the below code but did not work out as we cant concatenate string and an object. def spark_submit(self, test_cases, email): command =…
1
vote
1 answer

Spark-submit failing to resolve --package dependency when behind a HTTP proxy

Below is my spark-submit command /usr/bin/spark-submit \ --class "" \ --master yarn \ --queue default \ --deploy-mode cluster \ --conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=
Abdul Rahman
  • 1,294
  • 22
  • 41
1
vote
0 answers

Using `spark-submit` to start a job in a single node standalone spark cluster

I have a single node spark cluster (4 cpu cores and 15GB of memory) configured with a single worker. I can access the web UI and see the worker node. However, I am having trouble submitting the jobs using spark-submit. I have couple of questions. I…
1
vote
1 answer

spark2-submit throwing error with multiple packages (--packages)

I'm trying to submit following Spark2 job on CDH 5.16 cluster and it's only taking first parameter of --packages option and throwing error for second parameter spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1,…
Naga
  • 416
  • 3
  • 11
1
vote
1 answer

How to increase AM container size in spark-submit command? ERROR: container is running beyond physical memory limits

I am trying to execute a spark application on some data on AWS. I was able process whole data using 20 m4.large machines on AWS. Now, I tried the same using c4.8xlarge machines but got the following error: AM Container for…
1
vote
1 answer

Uber jar not found in Kubernetes via spark-submit

I have a very simple Spark job, but I can't get it to work in Kubernetes. The error I get is: > 19/10/03 14:59:51 WARN DependencyUtils: Local jar /opt/spark/work-dir/target/scala-2.11/ScalaTest-assembly-1.0.jar does > not exist, skipping. > …
Victor
  • 1,163
  • 4
  • 25
  • 45
1
vote
0 answers

Do we have retry configuration for Spark-SQL?

Do we have retry configuration for Spark-SQL? We have 'spark.yarn.maxAppAttempts' for spark-submit. Do we have similar conf for spark-sql?
Raj
  • 149
  • 1
  • 2
  • 16
1
vote
1 answer

spark-submit 'Unable to coerce 'startDate' to a formatted date (long)'

Getting error: error: Unable to coerce 'startDate' to a formatted date (long) when I ran spark submit as below: dse -u cassandra -p cassandra spark-submit --class com.abc.rm.Total_count \ --master dse://x.x.x.x:9042 TotalCount.jar \ …
Pramod
  • 113
  • 2
  • 14
1
vote
1 answer

Spark on k8s - Error: Missing application resource

I'm trying to run the SparkPi example using spark on k8s. Working with kubectl minikube spark-2.4.4-bin-hadoop2.7 Running the following command: spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi …
LiranBo
  • 2,054
  • 2
  • 23
  • 39
1
vote
1 answer

spark-submit cluster with HIVE, how to debug a failure on aws EMR

I'm creating an EMR cluster through the AWS EMR interface, but this time I'm trying to use HIVE and s3. So far I'm just trying to do something very simple: creating tables from existing parquet files into hive. from pyspark.sql import…
Jay Cee
  • 1,855
  • 5
  • 28
  • 48
1
vote
3 answers

Need help running spark-submit in Apache Airflow

I am a relatively new user to Python and Airflow and am having a very difficult time getting spark-submit to run in an Airflow task. My goal is to get the following DAG task to run successfully from datetime import datetime, timedelta from airflow…
mattc-7
  • 432
  • 1
  • 10
  • 24
1
vote
0 answers

spark-submit on local windows command line is failing | java.io.IOException: Failed to connect

I am trying to run the spark-submit from my local window box from command line, however, it is failing to connect. It was working couple of days back. I am not able to recall any changes I made to my box. spark-submit --class…