Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
1 answer

Why can't I change "spark.driver.memory" value in AWS Elastic Map Reduce?

I want to tune my spark cluster on AWS EMR and I couldn't change the default value of spark.driver.memory which leads every spark application to crash as my dataset is big. I tried editing the spark-defaults.conf file manually on the master…
1
vote
0 answers

Unable to load S3 parquet with postgresql driver in spark-shell

I am trying to load parquet file from in EMR spark-shell. Command: // to start spark spark-shell --driver-class-path postgresql-42.2.5.jar --jars postgresql-42.2.5.jar // to read…
bob
  • 4,595
  • 2
  • 25
  • 35
1
vote
1 answer

Access cross region s3 endpoint through private subnet

I have an EMR which is spinning up in eu-west-1 private subnet. I have defined a gateway endpoint for S3 in the route table. I have to access this public bucket/location exposed by AWS:…
ishan3243
  • 1,870
  • 4
  • 30
  • 49
1
vote
1 answer

Proper way to check if a folder exists in AWS S3 from AWS EMR?

Before calling this a duplicate, please read my question. I have found two methods of checking if a folder exists in S3 from EMR but I wonder which one is correct. To get the credentials of the EMR (eg. from a Spark application) machine to access…
belka
  • 1,480
  • 1
  • 18
  • 31
1
vote
0 answers

Python modules not on worker nodes for AWS-EMR

I am doing a ML project on AWS EMR clusters and use a bootstrap to setup my environment. I am running into a very common problem where my modules (in this case .py file I built) are not installed on my worker nodes. My workflow is to code in a .py…
J Doe
  • 173
  • 5
1
vote
1 answer

TEZ mapper resource request

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on…
kvb
  • 625
  • 3
  • 8
  • 12
1
vote
0 answers

Where is stored information from YARN applications AWS EMR (Application history)?

Context I run spark applications on an Amazon EMR cluster. These applications are orchestrated by Yarn. I didn't define yarn.nodemanager.log-dirs, spark.yarn.historyServer.address or other configurations. In Application history tab there is…
Tan4ek
  • 13
  • 3
1
vote
1 answer

AWS EMR dependencies

I am trying to translate the Java code in "End-to-End Amazon EMR Java Source Code Sample" to Scala. I am using SBT for dependency management. Here are my current relevant dependencies in build.sbt: //…
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
1
vote
1 answer

Problem in executing a shell script present on host using docker exec

I'm trying to execute a script on the master node of AWS EMR cluster. The intention is to create a new conda env and link it to jupyter. I'm following this doc from AWS. Problem is, whatever be the content of the script, I'm getting the same error:…
Bitswazsky
  • 4,242
  • 3
  • 29
  • 58
1
vote
1 answer

Adding S3 sync step in EMR

After performing all the steps, I want to execute the last step to copy S3 data to another bucket. I didn't find any supported script for running shell commands https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-commandrunner.html s3-dist-cp is…
Dev
  • 13,492
  • 19
  • 81
  • 174
1
vote
1 answer

With statements inside an Insert statement HIVE EMR AWS

Hive does not recognize my WITH statement inside of an INSERT command. How can I make hive understand this? I've created the external hive tables to store all of the data referenced in this query. That all executes fine and the data is available.…
Fish357
  • 87
  • 8
1
vote
2 answers

get ip of emr master node from yarn cli

In order to get a list of the ip addresses of emr slave nodes, one must run the following code: yarn node -list 2>/dev/null \ | sed -n "s/^\(ip[^:]*\):.*/\1/p" yarn node -list happens to print off the ip of the master node to stderr: 19/04/02…
Walrus the Cat
  • 2,314
  • 5
  • 35
  • 64
1
vote
1 answer

Call multiple spark jobs within single EMR cluster

I want to call multiple spark jobs using spark-submit within single EMR cluster. Does EMR supports this? How to achieve this? I use AWS Lambda to invoke EMR job for my spark job at this point of time but we would like to extend to multiple spark…
1
vote
3 answers

Create A record in CloudFormation for EMR master node private IP address

I would like to know if there is a way to declare a AWS::Route53::RecordSet in a CloudFormation config that points to the private IP address of the master node on a EMR cluster that is also defined in the same configuration? The CloudFormation…
1
vote
1 answer

Sqoop Import Error "Could not load db driver class" with Amazon EMR Service

I have created a EMR cluster with hadoop,Sqoop and Spark configuration. I am trying Sqoop Import but getting error "Could not load db driver class: com.mysql.jdbc.Driver" . My question is which location do we put the Mysql Driver ? I have…
Rahul Goyal
  • 433
  • 2
  • 8