Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : emr

452 questions

votes

3 answers

Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail

Everything works fine locally when I do as follows: cat input | python mapper.py | sort | python reducer.py However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does not complete successfully. The mapper.py runs part…

python hadoop amazon-web-services mapreduce elastic-map-reduce

asked Mar 26 '12 at 23:15

Ben G

26,091
34
103
170

votes

2 answers

parallel generation of random forests using scikit-learn

Main question: How do I combine different randomForests in python and scikit-learn? I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem. Since my…

python r scikit-learn random-forest elastic-map-reduce

asked Sep 18 '14 at 13:39

reddy

votes

1 answer

Elastic Map Reduce: difference between CANCEL_AND_WAIT and CONTINUE?

I just found that using Amazon's Elastic Map Reduce, I can specify a step to have one of three ActionOnFailure choices: TERMINATE_JOB_FLOW CANCEL_AND_WAIT CONTINUE TERMINATE_JOB_FLOW is the default and obvious - it shuts down the entire cluster…

boto elastic-map-reduce amazon-emr

asked Mar 07 '13 at 21:19

Suman

9,221
5
49
62

votes

1 answer

Python client support for running Hive on top of Amazon EMR

I've noticed that neither mrjob nor boto supports a Python interface to submit and run Hive jobs on Amazon Elastic MapReduce (EMR). Are there any other Python client libraries that supports running Hive on EMR?

python hive boto elastic-map-reduce

asked May 23 '11 at 22:36

poiuy

votes

3 answers

How to write data in Elasticsearch from Pyspark?

I have integrated ELK with Pyspark. saved RDD as ELK data on local file system rdd.saveAsTextFile("/tmp/ELKdata") logData = sc.textFile('/tmp/ELKdata/*') errors = logData.filter(lambda line: "raw1-VirtualBox" in line) errors.count() value i got…

elasticsearch apache-spark pyspark elastic-map-reduce

asked Jan 19 '16 at 06:19

pyspark

votes

1 answer

AWS EMR and Spark 1.0.0

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster. I am creating the cluster using something like : ./elastic-mapreduce --create --alive \ --name "ll_Spark_Cluster" \ --bootstrap-action…

amazon-web-services apache-spark elastic-map-reduce

asked Aug 21 '14 at 07:43

Eras

votes

1 answer

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb /…

amazon-web-services hadoop-streaming elastic-map-reduce hadoop-yarn hadoop2

asked Aug 07 '14 at 22:18

verve

votes

2 answers

Amazon Elastic Map Reduce - Creating a job flow

I'm very new to amazon services. I'm facing problems in creating job flows. Every time i create any job flow it fails or shuts down. Input, output or mapper function upload techniques are not clear to me. I have followed the developers section, but…

hadoop amazon-s3 amazon-ec2 elastic-map-reduce emr

asked Jan 22 '13 at 11:57

anan_xon

1,102
1
11
21

votes

1 answer

Setting hadoop parameters with boto?

I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here: http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code The link above says that I need to…

python boto elastic-map-reduce

asked Aug 22 '12 at 10:48

slavi

votes

2 answers

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using…

java hadoop mapreduce elastic-map-reduce emr

asked Feb 14 '12 at 20:45

Amar

11,930
5
50
73

votes

1 answer

Am I fully utilizing my EMR cluster?

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances). job submission: spark-submit myApplication.py graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely…

amazon-web-services apache-spark pyspark elastic-map-reduce

asked Jan 22 '17 at 01:08

Kristian

21,204
19
101
176

votes

2 answers

Use S3DistCp to copy file from S3 to EMR

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster. Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore. Some other sources suggest to use s3-dist-cp command, which…

amazon-s3 aws-sdk amazon-emr elastic-map-reduce s3distcp

asked Sep 08 '16 at 11:38

V. Samma

2,558
8
30
34

votes

1 answer

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out. How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's…

hadoop elastic-map-reduce instancetype

asked Apr 29 '14 at 09:29

Gavriel

18,880
12
68
105

votes

4 answers

copy files from amazon s3 to hdfs using s3distcp fails

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions…

hadoop amazon-s3 hdfs elastic-map-reduce

asked Jan 31 '13 at 17:00

raghuram gururajan

votes

3 answers

installing GIT on EMR

1) I have been told that git comes stock installed on EMR. Is this true ? I believe not, as I can confirm that "git" is not found in my elastic-mapreduce ssh terminal. See:…

git elastic-map-reduce

asked Jul 25 '12 at 15:59

jayunit100

17,388
22
92
167

Prev 1

…

30 31 Next