Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
24
votes
3 answers

Amazon EC2 vs. Amazon EMR

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want…
Bhavesh Shah
  • 3,299
  • 11
  • 49
  • 73
23
votes
2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…
retnuH
  • 1,525
  • 2
  • 11
  • 18
23
votes
4 answers

How to launch and configure an EMR cluster using boto

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch…
eran
  • 14,496
  • 34
  • 98
  • 144
23
votes
5 answers

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify…
bstempi
  • 2,023
  • 1
  • 15
  • 27
22
votes
6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…
Yuva
  • 2,831
  • 7
  • 36
  • 60
22
votes
2 answers

S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503;…
22
votes
2 answers

'Operation timed out' error on trying to ssh in to the Amazon EMR Spark Cluster

I'm trying to ssh into Amazon EMR Spark Cluster. Here's what I did: Get the cluster master's IP: aws emr describe-cluster --cluster-id | grep MasterPublicDnsName Use the IP to ssh into the box: ssh -i CSxxx.pem…
xpm
  • 353
  • 2
  • 10
22
votes
3 answers

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an…
retnuH
  • 1,525
  • 2
  • 11
  • 18
21
votes
3 answers

ValueError: Invalid endpoint: https://s3..amazonaws.com

When EMR machine is trying to run a step that includes boto3 initialisation it sometimes get the following error: ValueError: Invalid endpoint: https://s3..amazonaws.com When I'm trying to set up a new machine it can suddenly work. Attached the…
Aviv Oron
  • 505
  • 1
  • 6
  • 10
21
votes
2 answers

Amazon EC2 On-Demand Workers for Short Tasks

I am looking to build a web application which needs to run resource-intensive MCMC (Markov chain Monte Carlo) calculations on-demand in R to generate some probability graphs for the user. Constraints: Obviously I don't want to run the…
mikegreiling
  • 1,160
  • 12
  • 21
20
votes
6 answers

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code: ".. textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x) textRdd.collect().show() .." I got this error: An error was encountered: Invalid status code '400'…
anat
  • 705
  • 2
  • 7
  • 20
20
votes
1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
20
votes
2 answers

Saving dataframe to local file system results in empty results

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size: scala> df.count res0: Long = 4067 The following code works fine for writing df to hdfs: scala> val hdf =…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
20
votes
4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…
Yuva
  • 2,831
  • 7
  • 36
  • 60
20
votes
2 answers

How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is…
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83