Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : elastic-map-reduce amazon-emr

1166 questions

votes

3 answers

How to clean up the list of Terminated AWS EMR clusters?

I have about 88 EMR clusters that are terminated in my AWS account. How do I clean up the list of terminated EMR clusters? Will AWS clean up the list? How come I don't see the terminated clusters being removed from the list of clusters just like how…

amazon-web-services emr

asked May 05 '14 at 20:14

Nicholas Key

1,429
4
21
24

votes

2 answers

File already exists error writing new files from dataframe

On EMR Spark, writing an RDD[String] to S3 via a dataframe. rddString .toDF() .coalesce(16) .write .option("compression", "gzip") .mode(SaveMode.Overwrite) .json(s"s3n://my-bucket/some/new/path") Save mode is Overwrite and…

apache-spark emr

asked Mar 05 '18 at 01:28

Synesso

37,610
35
136
207

votes

1 answer

Optimizing GC on EMR cluster

I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234…

apache-spark garbage-collection jvm emr amazon-emr

asked Dec 07 '16 at 23:56

YetAnother

1,045
1
9
27

votes

3 answers

boto EMR add step and auto terminate

Python 2.7.12 boto3==1.3.1 How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds? Create the cluster response = client.run_job_flow( Name=name, …

python amazon-web-services boto3 emr

asked Oct 24 '17 at 12:49

duffn

3,690
8
33
68

votes

2 answers

Amazon EMR - how to set a timeout for a step

is there a way to set a timeout for a step in Amazon Aws EMR? I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours. I cannot find a way to set a timeout not in Spark, nor in…

apache-spark hadoop-yarn emr amazon-emr

asked Apr 21 '17 at 10:41

Erica

1,608
2
21
32

votes

3 answers

Livy Server on Amazon EMR hangs on Connecting to ResourceManager

I'm trying to deploy a Livy Server on Amazon EMR. First I built the Livy master branch mvn clean package -Pscala-2.11 -Pspark-2.0 Then, I uploaded it to the EMR cluster master. I set the following…

apache-spark hadoop-yarn cloudera emr

asked Oct 28 '16 at 20:10

matheusr

votes

2 answers

How to restart Spark service in EMR after changing conf settings?

I am using EMR-5.9.0 and after changing some configuration files I want to restart the service to see the effect. How can I achieve this? I tried to find the name of the service using initctl list, as I saw in other answers but no luck...

apache-spark emr amazon-emr

asked Oct 12 '17 at 12:24

Dimitris Poulopoulos

1,139
2
15
36

votes

2 answers

AWS connection timeout when running Spark job on EMR

I'm trying to submit a simple spark job in an Amazon EMR cluster. My cluster has 5 M4.2xlarge instances (1 master, 4 slaves), each with 16 vCPU, and 32 gigs of memory. This is my code: def main(args : Array[String]): Unit = { val sparkConfig = new…

hadoop apache-spark amazon-s3 apache-spark-sql emr

asked Aug 31 '17 at 00:36

drunkenfist

2,958
12
39
73

votes

1 answer

How to run Spark Scala code on Amazon EMR

I am trying to run the following piece of Spark code written in Scala on Amazon EMR: import org.apache.spark.{SparkConf, SparkContext} object TestRunner { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Hello…

scala amazon-web-services apache-spark emr amazon-emr

asked Oct 20 '16 at 21:36

pathikrit

32,469
37
142
221

votes

2 answers

Running Spark on AWS EMR, how to run driver on master node?

It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode…

amazon-web-services apache-spark emr

asked Feb 04 '16 at 19:40

Landon Kuhn

76,451
45
104
130

votes

2 answers

Where does EMR store Spark stdout?

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged? My S3 aws-logs directory structure for my cluster looks like: node ├── i-0031cd7a536a42g1e │ ├──…

amazon-web-services apache-spark amazon-s3 emr amazon-emr

asked Dec 07 '17 at 23:43

B. Smith

1,063
4
14
23

votes

2 answers

How can I connect PySpark (local machine) to my EMR cluster?

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH: ssh -i hadoop@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com Once ssh'd into the master node, I can…

apache-spark amazon-ec2 pyspark emr

asked Dec 01 '16 at 21:01

Soubhik

votes

2 answers

Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster

I am using EMR 5.0 with Spark 2.0.0. I am trying to run child spark application from Scala spark application using org.apache.spark.launcher.SparkLauncher I need to set SPARK_HOME using setSparkHome: var handle = new SparkLauncher() …

amazon-web-services apache-spark pyspark emr amazon-emr

asked Sep 15 '16 at 12:30

Ulile

votes

1 answer

YARN: What is the difference between number-of-executors and executor-cores in Spark?

I am learning Spark on AWS EMR. In the process I am trying to understand the difference between number of executors(--num-executors) and executor cores (--executor-cores). Can any one please tell me here? Also when I am trying to submit the…

apache-spark hadoop-yarn emr

asked Apr 25 '16 at 23:26

AIR

votes

0 answers

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a lost node

I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf…

amazon-web-services apache-spark hadoop-yarn emr

asked Dec 05 '15 at 06:37

Edamame

23,718
73
186
320

Prev 1 2

…

77 78 Next