Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
9
votes
3 answers

Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail

Everything works fine locally when I do as follows: cat input | python mapper.py | sort | python reducer.py However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does not complete successfully. The mapper.py runs part…
Ben G
  • 26,091
  • 34
  • 103
  • 170
9
votes
2 answers

parallel generation of random forests using scikit-learn

Main question: How do I combine different randomForests in python and scikit-learn? I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem. Since my…
reddy
  • 180
  • 1
  • 8
9
votes
1 answer

Elastic Map Reduce: difference between CANCEL_AND_WAIT and CONTINUE?

I just found that using Amazon's Elastic Map Reduce, I can specify a step to have one of three ActionOnFailure choices: TERMINATE_JOB_FLOW CANCEL_AND_WAIT CONTINUE TERMINATE_JOB_FLOW is the default and obvious - it shuts down the entire cluster…
Suman
  • 9,221
  • 5
  • 49
  • 62
8
votes
1 answer

Python client support for running Hive on top of Amazon EMR

I've noticed that neither mrjob nor boto supports a Python interface to submit and run Hive jobs on Amazon Elastic MapReduce (EMR). Are there any other Python client libraries that supports running Hive on EMR?
poiuy
  • 500
  • 5
  • 12
8
votes
3 answers

How to write data in Elasticsearch from Pyspark?

I have integrated ELK with Pyspark. saved RDD as ELK data on local file system rdd.saveAsTextFile("/tmp/ELKdata") logData = sc.textFile('/tmp/ELKdata/*') errors = logData.filter(lambda line: "raw1-VirtualBox" in line) errors.count() value i got…
8
votes
1 answer

AWS EMR and Spark 1.0.0

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster. I am creating the cluster using something like : ./elastic-mapreduce --create --alive \ --name "ll_Spark_Cluster" \ --bootstrap-action…
Eras
  • 428
  • 3
  • 11
8
votes
1 answer

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb /…
8
votes
2 answers

Amazon Elastic Map Reduce - Creating a job flow

I'm very new to amazon services. I'm facing problems in creating job flows. Every time i create any job flow it fails or shuts down. Input, output or mapper function upload techniques are not clear to me. I have followed the developers section, but…
anan_xon
  • 1,102
  • 1
  • 11
  • 21
8
votes
1 answer

Setting hadoop parameters with boto?

I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here: http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code The link above says that I need to…
slavi
  • 401
  • 3
  • 10
7
votes
2 answers

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using…
Amar
  • 11,930
  • 5
  • 50
  • 73
7
votes
1 answer

Am I fully utilizing my EMR cluster?

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances). job submission: spark-submit myApplication.py graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely…
Kristian
  • 21,204
  • 19
  • 101
  • 176
7
votes
2 answers

Use S3DistCp to copy file from S3 to EMR

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster. Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore. Some other sources suggest to use s3-dist-cp command, which…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
7
votes
1 answer

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out. How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's…
Gavriel
  • 18,880
  • 12
  • 68
  • 105
7
votes
4 answers

copy files from amazon s3 to hdfs using s3distcp fails

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions…
raghuram gururajan
  • 533
  • 2
  • 13
  • 26
7
votes
3 answers

installing GIT on EMR

1) I have been told that git comes stock installed on EMR. Is this true ? I believe not, as I can confirm that "git" is not found in my elastic-mapreduce ssh terminal. See:…
jayunit100
  • 17,388
  • 22
  • 92
  • 167
1
2
3
30 31