Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
14
votes
3 answers

How to clean up the list of Terminated AWS EMR clusters?

I have about 88 EMR clusters that are terminated in my AWS account. How do I clean up the list of terminated EMR clusters? Will AWS clean up the list? How come I don't see the terminated clusters being removed from the list of clusters just like how…
Nicholas Key
  • 1,429
  • 4
  • 21
  • 24
13
votes
2 answers

File already exists error writing new files from dataframe

On EMR Spark, writing an RDD[String] to S3 via a dataframe. rddString .toDF() .coalesce(16) .write .option("compression", "gzip") .mode(SaveMode.Overwrite) .json(s"s3n://my-bucket/some/new/path") Save mode is Overwrite and…
Synesso
  • 37,610
  • 35
  • 136
  • 207
13
votes
1 answer

Optimizing GC on EMR cluster

I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234…
YetAnother
  • 1,045
  • 1
  • 9
  • 27
12
votes
3 answers

boto EMR add step and auto terminate

Python 2.7.12 boto3==1.3.1 How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds? Create the cluster response = client.run_job_flow( Name=name, …
duffn
  • 3,690
  • 8
  • 33
  • 68
12
votes
2 answers

Amazon EMR - how to set a timeout for a step

is there a way to set a timeout for a step in Amazon Aws EMR? I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours. I cannot find a way to set a timeout not in Spark, nor in…
Erica
  • 1,608
  • 2
  • 21
  • 32
12
votes
3 answers

Livy Server on Amazon EMR hangs on Connecting to ResourceManager

I'm trying to deploy a Livy Server on Amazon EMR. First I built the Livy master branch mvn clean package -Pscala-2.11 -Pspark-2.0 Then, I uploaded it to the EMR cluster master. I set the following…
matheusr
  • 567
  • 9
  • 29
11
votes
2 answers

How to restart Spark service in EMR after changing conf settings?

I am using EMR-5.9.0 and after changing some configuration files I want to restart the service to see the effect. How can I achieve this? I tried to find the name of the service using initctl list, as I saw in other answers but no luck...
Dimitris Poulopoulos
  • 1,139
  • 2
  • 15
  • 36
11
votes
2 answers

AWS connection timeout when running Spark job on EMR

I'm trying to submit a simple spark job in an Amazon EMR cluster. My cluster has 5 M4.2xlarge instances (1 master, 4 slaves), each with 16 vCPU, and 32 gigs of memory. This is my code: def main(args : Array[String]): Unit = { val sparkConfig = new…
drunkenfist
  • 2,958
  • 12
  • 39
  • 73
11
votes
1 answer

How to run Spark Scala code on Amazon EMR

I am trying to run the following piece of Spark code written in Scala on Amazon EMR: import org.apache.spark.{SparkConf, SparkContext} object TestRunner { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Hello…
pathikrit
  • 32,469
  • 37
  • 142
  • 221
11
votes
2 answers

Running Spark on AWS EMR, how to run driver on master node?

It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode…
Landon Kuhn
  • 76,451
  • 45
  • 104
  • 130
10
votes
2 answers

Where does EMR store Spark stdout?

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged? My S3 aws-logs directory structure for my cluster looks like: node ├── i-0031cd7a536a42g1e │   ├──…
B. Smith
  • 1,063
  • 4
  • 14
  • 23
10
votes
2 answers

How can I connect PySpark (local machine) to my EMR cluster?

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH: ssh -i hadoop@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com Once ssh'd into the master node, I can…
Soubhik
  • 103
  • 4
10
votes
2 answers

Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster

I am using EMR 5.0 with Spark 2.0.0. I am trying to run child spark application from Scala spark application using org.apache.spark.launcher.SparkLauncher I need to set SPARK_HOME using setSparkHome: var handle = new SparkLauncher() …
Ulile
  • 251
  • 1
  • 3
  • 9
10
votes
1 answer

YARN: What is the difference between number-of-executors and executor-cores in Spark?

I am learning Spark on AWS EMR. In the process I am trying to understand the difference between number of executors(--num-executors) and executor cores (--executor-cores). Can any one please tell me here? Also when I am trying to submit the…
AIR
  • 817
  • 12
  • 24
10
votes
0 answers

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a *lost* node

I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf…
Edamame
  • 23,718
  • 73
  • 186
  • 320
1 2
3
77 78