Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

votes

3 answers

Amazon EC2 vs. Amazon EMR

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want…

asked Apr 11 '12 at 05:09

Bhavesh Shah

3,299
11
49
73

votes

2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…

apache-spark hadoop-yarn emr amazon-emr elastic-map-reduce

asked Nov 26 '15 at 14:16

retnuH

1,525
2
11
18

votes

4 answers

How to launch and configure an EMR cluster using boto

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch…

python amazon-web-services boto amazon-emr

asked Oct 11 '14 at 11:50

eran

14,496
34
98
144

votes

5 answers

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify…

amazon-web-services hadoop amazon-emr

asked Nov 26 '13 at 20:16

bstempi

2,023
1
15
27

votes

6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…

amazon-web-services etl amazon-emr aws-glue

asked Jan 12 '18 at 09:09

Yuva

2,831
7
36
60

votes

2 answers

S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503;…

scala apache-spark amazon-s3 amazon-emr apache-spark-dataset

asked Sep 07 '17 at 18:59

Mikel San Vicente

3,831
2
21
39

votes

2 answers

'Operation timed out' error on trying to ssh in to the Amazon EMR Spark Cluster

I'm trying to ssh into Amazon EMR Spark Cluster. Here's what I did: Get the cluster master's IP: aws emr describe-cluster --cluster-id | grep MasterPublicDnsName Use the IP to ssh into the box: ssh -i CSxxx.pem…

apache-spark ssh amazon-emr

asked Aug 23 '16 at 08:06

xpm

votes

3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…

apache-spark hadoop-yarn emr amazon-emr elastic-map-reduce

asked Nov 30 '15 at 16:51

retnuH

1,525
2
11
18

votes

3 answers

ValueError: Invalid endpoint: https://s3..amazonaws.com

When EMR machine is trying to run a step that includes boto3 initialisation it sometimes get the following error: ValueError: Invalid endpoint: https://s3..amazonaws.com When I'm trying to set up a new machine it can suddenly work. Attached the…

python amazon-web-services amazon-s3 boto3 amazon-emr

asked Sep 15 '19 at 10:05

Aviv Oron

votes

2 answers

Amazon EC2 On-Demand Workers for Short Tasks

I am looking to build a web application which needs to run resource-intensive MCMC (Markov chain Monte Carlo) calculations on-demand in R to generate some probability graphs for the user. Constraints: Obviously I don't want to run the…

r amazon-ec2 amazon-emr amazon-swf

asked Jun 10 '12 at 13:39

mikegreiling

1,160
12
21

votes

6 answers

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code: ".. textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x) textRdd.collect().show() .." I got this error: An error was encountered: Invalid status code '400'…

pyspark amazon-emr

asked Sep 23 '19 at 12:48

anat

votes

1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…

concurrency limit amazon-emr amazon-athena aws-glue

asked Jul 22 '19 at 12:22

Ilya Kisil

2,490
2
17
31

votes

2 answers

Saving dataframe to local file system results in empty results

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size: scala> df.count res0: Long = 4067 The following code works fine for writing df to hdfs: scala> val hdf =…

apache-spark amazon-emr

asked Jul 30 '18 at 23:07

WestCoastProjects

58,982
91
316
560

votes

4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…

amazon-web-services amazon-emr aws-glue cost-management

asked Feb 07 '18 at 11:32

Yuva

2,831
7
36
60

votes

2 answers

How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is…

apache-spark-sql hadoop2 amazon-emr

asked Oct 15 '17 at 11:16

Sudarshan kumar

1,503
4
36
83

Prev 1

…

99 100 Next