Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
1 answer

How to get to this path /etc/hadoop/conf on EMR cluster?

I am new to EMR and Spark. I am going through this steps mentioned here https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/ . In step#5 it says Copy all files in /etc/hadoop/conf on the remote Amazon EMR…
Prakash Raj
  • 149
  • 3
  • 13
1
vote
1 answer

Memory allocation in Yarn Amazon EMR

I am having this error in pyspark (Amazon EMR), my file is about 2G. How can I do to change the allocation? Thanks In tried to increase the size of the cluster, at some stages I still have the problem Py4JJavaError: An error occurred while calling …
user11837980
1
vote
2 answers

Provisioning EMR nodes with custom files

I'm trying to run jar with Apache Nutch dependency on AWS EMR Hadoop cluster. The problem is that Nutch can't find plugin classes (I'm specifying plugins location with -Dplugin.folders). I tested this option locally and it's working fine: java -cp…
Kirill
  • 7,580
  • 6
  • 44
  • 95
1
vote
2 answers

Optimizing AWS spending without having access to account billing info

I would like to know if it is possible to evaluate and optimize AWS spending (specifically, EC2 spending) without having access to account billing info? Long story short, we do not have the ability to view account billing dashboards / metrics due to…
James Wierzba
  • 16,176
  • 14
  • 79
  • 120
1
vote
2 answers

EMR creation task and core nodes not able to specify as "Max on demand" for spot pricing

core_instance_group { instance_type = "c4.large" instance_count = 1 ebs_config { size = "40" type = "gp2" volumes_per_instance = 1 } bid_price = "0.30" I would require the…
1
vote
1 answer

How can I know in advance the EMR resources needed to perform a join with Big Data?

I want to throw an EMR step from which I only know the following: It will have to read several files of size X GB from S3 I also know the step will need to perform a join among subsets of data from those files. Is there a logic/formula for…
Lluc
  • 21
  • 3
1
vote
1 answer

Performance of AWS EMR over S3 compared to Server with harddisk storage

We have around 10 TB of data from the customer which have to load and query using hive and create aggregation tables which again has to be queried multiple times. I am planning to use AWS S3 to store 10 TB data in one bucket and query the data…
Srihari Karanth
  • 2,067
  • 2
  • 24
  • 34
1
vote
1 answer

Spark S3Guard - Skip listing S3

I'm using Spark (2.4) to process I data being stored on S3. I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M ) I know about S3Guard that stores the objects…
Modi
  • 2,200
  • 4
  • 23
  • 37
1
vote
1 answer

What IAM role should be assigned to aws lambda function so that it can get the emr cluster status

I've prepared a simple lambda function in AWS to terminate long running EMR clusters after a certain threshold is reached. This code snippet is tested locally and is working perfectly fine. Now I pushed it into a lambda, took care of the library…
Bitswazsky
  • 4,242
  • 3
  • 29
  • 58
1
vote
0 answers

How to migrate HUE 4.2 to HUE 4.4 on EMR cluster

I'm currently running an EMR 5.17.0 cluster with HUE 4.2, now I'm planning to upgrade my EMR to 5.24 and migrate the HUE from 4.2 to 4.4. I've followed the instruction from AWS "How to migrate a Hue database from an existing Amazon EMR…
SharpLu
  • 1,136
  • 2
  • 12
  • 28
1
vote
2 answers

How to save a file from pyspark dataframe which can be accessible later to upload it to S3?

I want to write a csv file on S3 which should be formed from a dataframe. I tried saving the dataframe to csv as in the normal api but unfortunately, that is not accessible later on while uploading the file to S3. I then thought of saving the file…
Aviral Srivastava
  • 4,058
  • 8
  • 29
  • 81
1
vote
1 answer

EMR Spark step to append to parquet files is overwriting parquet files

Spark 2.4.2 on an Amazon EMR Cluster (1 master, 2 nodes) using Python 3.6 I am reading objects in Amazon s3, compressing them in parquet format, and adding them (appending) to an existing store of parquet data. When I run my code in a pyspark shell…
Eric
  • 145
  • 1
  • 1
  • 9
1
vote
1 answer

Cannot connect/query from Presto on AWS EMR with Java JDBC

If I ssh onto the master node of my presto emr cluster, I can run queries. However, I would like to be able to run queries from java source code on my local machine that connects to the emr cluster. I set up my presto emr cluster with default…
moontartan
  • 31
  • 3
1
vote
2 answers

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
Kshitij Kohli
  • 4,055
  • 4
  • 19
  • 27
1
vote
2 answers

How do I specify the Spark configuration when running on EMR?

So I'm trying to run a Spark pipeline on EMR, and I'm creating a step like so: // Build the Spark job submission request val runSparkJob = new StepConfig() .withName("Run Pipeline") .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER) …
alexgolec
  • 26,898
  • 33
  • 107
  • 159