Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

vote

1 answer

How to get to this path /etc/hadoop/conf on EMR cluster?

I am new to EMR and Spark. I am going through this steps mentioned here https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/ . In step#5 it says Copy all files in /etc/hadoop/conf on the remote Amazon EMR…

asked Jul 27 '19 at 01:27

Prakash Raj

vote

1 answer

Memory allocation in Yarn Amazon EMR

I am having this error in pyspark (Amazon EMR), my file is about 2G. How can I do to change the allocation? Thanks In tried to increase the size of the cluster, at some stages I still have the problem Py4JJavaError: An error occurred while calling …

amazon-web-services pyspark amazon-emr

asked Jul 25 '19 at 18:51

user11837980

vote

2 answers

Provisioning EMR nodes with custom files

I'm trying to run jar with Apache Nutch dependency on AWS EMR Hadoop cluster. The problem is that Nutch can't find plugin classes (I'm specifying plugins location with -Dplugin.folders). I tested this option locally and it's working fine: java -cp…

java hadoop amazon-emr nutch

asked Jul 24 '19 at 16:25

Kirill

7,580
6
44
95

vote

2 answers

Optimizing AWS spending without having access to account billing info

I would like to know if it is possible to evaluate and optimize AWS spending (specifically, EC2 spending) without having access to account billing info? Long story short, we do not have the ability to view account billing dashboards / metrics due to…

amazon-web-services amazon-ec2 amazon-emr

asked Jul 24 '19 at 14:43

James Wierzba

16,176
14
79
120

vote

2 answers

EMR creation task and core nodes not able to specify as "Max on demand" for spot pricing

core_instance_group { instance_type = "c4.large" instance_count = 1 ebs_config { size = "40" type = "gp2" volumes_per_instance = 1 } bid_price = "0.30" I would require the…

amazon-web-services terraform amazon-emr terraform-provider-aws

asked Jul 23 '19 at 15:22

3br10ee032

vote

1 answer

How can I know in advance the EMR resources needed to perform a join with Big Data?

I want to throw an EMR step from which I only know the following: It will have to read several files of size X GB from S3 I also know the step will need to perform a join among subsets of data from those files. Is there a logic/formula for…

apache-spark bigdata amazon-emr

asked Jul 23 '19 at 13:34

Lluc

vote

1 answer

Performance of AWS EMR over S3 compared to Server with harddisk storage

We have around 10 TB of data from the customer which have to load and query using hive and create aggregation tables which again has to be queried multiple times. I am planning to use AWS S3 to store 10 TB data in one bucket and query the data…

amazon-web-services amazon-s3 hive amazon-emr

asked Jul 23 '19 at 10:10

Srihari Karanth

2,067
2
24
34

vote

1 answer

Spark S3Guard - Skip listing S3

I'm using Spark (2.4) to process I data being stored on S3. I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M ) I know about S3Guard that stores the objects…

apache-spark cloudera amazon-emr

asked Jul 13 '19 at 05:06

Modi

2,200
4
23
37

vote

1 answer

What IAM role should be assigned to aws lambda function so that it can get the emr cluster status

I've prepared a simple lambda function in AWS to terminate long running EMR clusters after a certain threshold is reached. This code snippet is tested locally and is working perfectly fine. Now I pushed it into a lambda, took care of the library…

amazon-web-services aws-lambda amazon-emr

asked Jul 12 '19 at 03:26

Bitswazsky

4,242
3
29
58

vote

0 answers

How to migrate HUE 4.2 to HUE 4.4 on EMR cluster

I'm currently running an EMR 5.17.0 cluster with HUE 4.2, now I'm planning to upgrade my EMR to 5.24 and migrate the HUE from 4.2 to 4.4. I've followed the instruction from AWS "How to migrate a Hue database from an existing Amazon EMR…

django hive amazon-emr hue

asked Jul 11 '19 at 13:03

SharpLu

1,136
2
12
28

vote

2 answers

How to save a file from pyspark dataframe which can be accessible later to upload it to S3?

I want to write a csv file on S3 which should be formed from a dataframe. I tried saving the dataframe to csv as in the normal api but unfortunately, that is not accessible later on while uploading the file to S3. I then thought of saving the file…

python python-3.x apache-spark pyspark amazon-emr

asked Jul 11 '19 at 12:41

Aviral Srivastava

4,058
8
29
81

vote

1 answer

EMR Spark step to append to parquet files is overwriting parquet files

Spark 2.4.2 on an Amazon EMR Cluster (1 master, 2 nodes) using Python 3.6 I am reading objects in Amazon s3, compressing them in parquet format, and adding them (appending) to an existing store of parquet data. When I run my code in a pyspark shell…

python apache-spark amazon-emr parquet

asked Jul 10 '19 at 12:31

Eric

vote

1 answer

Cannot connect/query from Presto on AWS EMR with Java JDBC

If I ssh onto the master node of my presto emr cluster, I can run queries. However, I would like to be able to run queries from java source code on my local machine that connects to the emr cluster. I set up my presto emr cluster with default…

jdbc amazon-ec2 amazon-emr presto

asked Jul 09 '19 at 20:02

moontartan

vote

2 answers

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?

amazon-web-services amazon-s3 amazon-emr distcp s3distcp

asked Jul 05 '19 at 10:00

Kshitij Kohli

4,055
4
19
27

vote

2 answers

How do I specify the Spark configuration when running on EMR?

So I'm trying to run a Spark pipeline on EMR, and I'm creating a step like so: // Build the Spark job submission request val runSparkJob = new StepConfig() .withName("Run Pipeline") .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER) …

amazon-web-services apache-spark amazon-emr aws-step-config

asked Jul 02 '19 at 19:06

alexgolec

26,898
33
107
159

Prev 1 2 3

…

99 100 Next