Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : elastic-map-reduce amazon-emr

1166 questions

votes

3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…

apache-spark hadoop-yarn emr amazon-emr elastic-map-reduce

asked Nov 30 '15 at 16:51

retnuH

1,525
2
11
18

votes

5 answers

Where are the Spark logs on EMR?

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR. Where can I access these? I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set…

scala apache-spark emr

asked May 27 '15 at 23:38

Sean Bollin

votes

4 answers

Spark resources not fully allocated on Amazon EMR

I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I…

apache-spark hadoop-yarn emr

asked Jun 08 '15 at 15:47

Michel Lemay

2,054
2
17
34

votes

2 answers

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find…

apache-spark pyspark emr

asked Jul 11 '16 at 20:08

Edamame

23,718
73
186
320

votes

4 answers

Any Scala SDK or interface for AWS?

Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs.

scala amazon-web-services emr amazon-emr

asked Jun 06 '13 at 21:35

CruncherBigData

1,112
3
14
34

votes

2 answers

EMR Spark - TransportClient: Failed to send RPC

I'm getting this error, I tried to increase memory on cluster instances and in the executor and driver parameters without success. 17/05/07 23:17:07 ERROR TransportClient: Failed to send RPC 6465703946954088562 to…

apache-spark hadoop-yarn emr

asked May 24 '17 at 12:51

Luis Sobrecueva

votes

2 answers

AWS EMR perform "bootstrap" script on all the already running machines in cluster

I have one EMR cluster which is running 24/7. I can't turn it off and launch the new one. What I would like to do is to perform something like bootstrap action on the already running cluster, preferably using Python and boto or AWS CLI. I can…

python amazon-web-services boto emr amazon-emr

asked Oct 26 '14 at 17:18

ziky90

2,627
4
33
47

votes

3 answers

collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One…

pandas apache-spark pyspark emr amazon-emr

asked Nov 28 '17 at 16:13

Rami

8,044
18
66
108

votes

4 answers

How to set a custom environment variable in EMR to be available for a spark Application

I need to set a custom environment variable in EMR to be available when running a spark application. I have tried adding this: ... --configurations '[ …

amazon-web-services hadoop apache-spark environment-variables emr

asked Feb 22 '17 at 15:00

NetanelRabinowitz

1,534
2
14
26

votes

2 answers

terminating a spark step in aws

I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs…

hadoop amazon-web-services apache-spark emr

asked Jan 26 '16 at 17:28

Daniel Imberman

votes

1 answer

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb

hadoop hadoop-yarn hadoop2 emr elastic-map-reduce

asked Jan 07 '16 at 22:31

fo_x86

2,583
1
30
41

votes

3 answers

How to suppress INFO messages for spark-sql running on EMR?

I'm running Spark on EMR as described in Run Spark and Spark SQL on Amazon Elastic MapReduce: This tutorial walks you through installing and operating Spark, a fast and general engine for large-scale data processing, on an Amazon EMR cluster.…

log4j apache-spark emr

asked Dec 14 '14 at 02:02

rongenre

1,334
11
21

votes

5 answers

Force Server Side Encryption for S3 Bucket

I want to set an S3 bucket policy so that all requests to upload to that bucket will use server side encryption, even if it is not specified in the request header. I have seen this post (Amazon S3 Server Side Encryption Bucket Policy problems) where…

encryption amazon-web-services amazon-s3 emr

asked Apr 14 '14 at 20:06

qwwqwwq

6,999
2
26
49

votes

2 answers

Boosting spark.yarn.executor.memoryOverhead

I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message: Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory…

amazon-web-services apache-spark pyspark emr amazon-emr

asked Jun 29 '16 at 13:58

masta-g3

1,202
4
17
27

votes

1 answer

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of…

hadoop amazon-web-services hadoop-streaming emr

asked Jun 14 '14 at 10:15

verve

Prev 1

…

77 78 Next