Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
2
votes
0 answers

Reduce Elastic Map Reduce runtime

I use Elastic Map Reduce to analyze large amount of data (stored on S3). What is the most cost efficient way to reduce the runtime of the job other than increasing the size of the instance. If I create more smaller files on S3 will it reduce the…
ljaerj
  • 157
  • 1
  • 11
2
votes
2 answers

Use gzip input codec on files without .gz extension in hadoop

I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files Unfortunately, in my case, the input files don't have a .gz extension. I'm using CombineTextInputFormatClass, which…
John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
2
votes
1 answer

How to force Hadoop to unzip inputs regadless of their extension?

I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension. Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the…
GilLevi
  • 2,117
  • 5
  • 22
  • 38
2
votes
1 answer

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is. Pig Code…
2
votes
0 answers

Elastic MapReduce with boto - InstanceProfile is required for creating cluster

Im trying to do a elastic mapreduce job with code below, but when I try this I get an error: InstanceProfile is required for creating cluster Someone knows why Im getting this error? def createmrjob(dict): emr =…
techman
  • 423
  • 1
  • 7
  • 17
2
votes
2 answers

AWS EMR validation error

I have a problem running a map-reduce java application I simplified my problem using the tutorial code given from AWS which runs a pre-defined step: public class Main { public static void main(String[] args) { AWSCredentials credentials =…
2
votes
2 answers

Mapreduce job to HBase throws IOException: Pass a Delete or a Put

I am trying to output to a HBase table directly from my Mapper while using Hadoop2.4.0 with HBase0.94.18 on EMR. I am getting a nasty IOException: Pass a Delete or a Put when executing the code below. public class TestHBase { static class…
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
2
votes
1 answer

Elasticsearch _cat/indices is giving error?

Currently I am using elasticsearch helper scan api, but it is not able to fetch data. command : helpers.scan( client=client, query={"query":{"match_all":{}}}, scroll='10m', index="debug", doc_type = "tool",…
Birendra Kumar
  • 431
  • 1
  • 7
  • 18
2
votes
1 answer

Amazon EMR sorting

I am new to Amazon EMR, and I am trying to understand how does the sorting phase after the map (before the reduce phase) works and if I can manipulate it (by some how supplying it my own compare function. If you know how the output from the map…
ohad edelstain
  • 1,425
  • 2
  • 14
  • 22
2
votes
1 answer

How to use Python streaming UDFs in pig on Amazon EMR

Pig 0.12 introduced streaming python UDFs, but they're experimental, so they need Hadoop 1. http://pig.apache.org/docs/r0.12.1/udf.html#python-udfs However, the only Amazon-provided AMI that can use pig 0.12 is AMI 3.1.0, which uses hadoop 2.4, not…
warbaker
  • 307
  • 3
  • 9
2
votes
1 answer

Does Hadoop Streaming's performance decrease if I use -mapper cat rather than -mapper org.apache.hadoop.mapred.lib.IdentityMapper?

I have had problems trying to use org.apache.hadoop.mapred.lib.IdentityMapper as the argument of -mapper in Hadoop Streaming 1.0.3. "cat" works though; does using cat affect performance -- especially on Elastic MapReduce?
verve
  • 775
  • 1
  • 9
  • 21
2
votes
1 answer

How to debug Pig being stuck after job submission

I have a map-reduce job written in Pig that is doing the following. Given a set of apache log files representing visits to a certain resource on a website clean the logs from the robots and from the unwanted log lines produce the tuples (ip,…
mottalrd
  • 4,390
  • 5
  • 25
  • 31
2
votes
1 answer

"Unable to verify integrity of data" while running MR job

I'm running a relatively big MR job using Amazon Elastic Map Reduce. I ran the job plenty of times on small data sets with no problem. But when trying to run it on a large dataset I'm getting the following exception: Error:…
2
votes
0 answers

EMR hadoop tasks agonize for hours when losing task nodes

I've set up an Amazon EMR jobflow with 1 on-demand core node and 4 task nodes with bidding. When I run my task on only the core node each step finishes within 1 hour. When I'm lucky and have 1 core + 4 task nodes then steps usually finish within 10…
Gavriel
  • 18,880
  • 12
  • 68
  • 105
2
votes
5 answers

How to setup an elasticsearch cluster

I am trying to setup a multi node elastic search cluster.Any useful link which i can follow to setup cluster. I am trying to run a map reduce programe in cluster to find out exact matches .
Amaresh
  • 3,231
  • 7
  • 37
  • 60