Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
22
votes
3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…
retnuH
  • 1,525
  • 2
  • 11
  • 18
22
votes
5 answers

Where are the Spark logs on EMR?

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR. Where can I access these? I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set…
Sean Bollin
  • 870
  • 2
  • 10
  • 17
21
votes
4 answers

Spark resources not fully allocated on Amazon EMR

I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I…
Michel Lemay
  • 2,054
  • 2
  • 17
  • 34
20
votes
2 answers

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find…
Edamame
  • 23,718
  • 73
  • 186
  • 320
20
votes
4 answers

Any Scala SDK or interface for AWS?

Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs.
CruncherBigData
  • 1,112
  • 3
  • 14
  • 34
19
votes
2 answers

EMR Spark - TransportClient: Failed to send RPC

I'm getting this error, I tried to increase memory on cluster instances and in the executor and driver parameters without success. 17/05/07 23:17:07 ERROR TransportClient: Failed to send RPC 6465703946954088562 to…
Luis Sobrecueva
  • 680
  • 1
  • 6
  • 13
17
votes
2 answers

AWS EMR perform "bootstrap" script on all the already running machines in cluster

I have one EMR cluster which is running 24/7. I can't turn it off and launch the new one. What I would like to do is to perform something like bootstrap action on the already running cluster, preferably using Python and boto or AWS CLI. I can…
ziky90
  • 2,627
  • 4
  • 33
  • 47
16
votes
3 answers

collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One…
Rami
  • 8,044
  • 18
  • 66
  • 108
15
votes
4 answers

How to set a custom environment variable in EMR to be available for a spark Application

I need to set a custom environment variable in EMR to be available when running a spark application. I have tried adding this: ... --configurations '[ …
15
votes
2 answers

terminating a spark step in aws

I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs…
Daniel Imberman
  • 618
  • 1
  • 5
  • 18
15
votes
1 answer

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb
fo_x86
  • 2,583
  • 1
  • 30
  • 41
15
votes
3 answers

How to suppress INFO messages for spark-sql running on EMR?

I'm running Spark on EMR as described in Run Spark and Spark SQL on Amazon Elastic MapReduce: This tutorial walks you through installing and operating Spark, a fast and general engine for large-scale data processing, on an Amazon EMR cluster.…
rongenre
  • 1,334
  • 11
  • 21
15
votes
5 answers

Force Server Side Encryption for S3 Bucket

I want to set an S3 bucket policy so that all requests to upload to that bucket will use server side encryption, even if it is not specified in the request header. I have seen this post (Amazon S3 Server Side Encryption Bucket Policy problems) where…
qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
14
votes
2 answers

Boosting spark.yarn.executor.memoryOverhead

I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message: Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory…
masta-g3
  • 1,202
  • 4
  • 17
  • 27
14
votes
1 answer

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of…
verve
  • 775
  • 1
  • 9
  • 21
1
2
3
77 78