Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
0 answers

MapReduce job skipping / resizing?

I'm running a job on the whole commoncrawl corpus, and ran into this hiccup in the remaining map tasks. Anyone know what would cause that? I suspect it has something to do with this "resizing complete" message, but am not sure why the job would stop…
kelorek
  • 6,042
  • 6
  • 29
  • 32
1
vote
1 answer

Hive / Map-Reduce Job on a Hadoop cluster: How to (roughly) calculate the diskspace needed?

following use case: I run a hive query on data which has about 500GB size in .gz compression: select count(distinct c1), c2 from t1 group by c2; This query results in ~2800 map jobs and ~400 reduce jobs. When setting up a Hadoop cluster with 20…
saschor
  • 319
  • 4
  • 12
1
vote
2 answers

Reading large files using mapreduce in hadoop

I have a code that reads files from FTP server and writes it into HDFS. I have implemented a customised InputFormatReader that sets the isSplitable property of the input as false .However this gives me the following error. INFO mapred.MapTask:…
RadAl
  • 404
  • 5
  • 23
1
vote
2 answers

Producing ngram frequencies for a large dataset

I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes. What's the best and most cost efficient way to do it in the next day or so? My thoughts are: PostgreSQL using…
Max
  • 2,760
  • 1
  • 28
  • 47
1
vote
1 answer

EC2 parallel processing custom packages using segue

I am using the R segue package (downloadable from here) to carry out parallel processing. I would like to source a package to be installed when setting up clusters. The package is my own that I have made, and I have converted it into a tar.gz file…
h.l.m
  • 13,015
  • 22
  • 82
  • 169
1
vote
0 answers

Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143"

My hadoop streaming map-reduce jobs on Amazon EMR keep failing with the following error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143 at…
Michael Barton
  • 8,868
  • 9
  • 35
  • 43
1
vote
2 answers

What ports does Apache Hadoop version 1.0.3 use for intracluster communicaion of the daemons

I know port 22 is only used for control scripts. But i need to know what ports I should open for my 3 node cluster. 2 slaves, 1 namenode/jobtracker. On what port do the daemons run? On what ports are the URLs displayed? The hadoop distro is: Apache…
user836087
  • 2,271
  • 8
  • 23
  • 33
1
vote
0 answers

MapReduce and similar methods to accelerate indexing of Apache Solr

I am building a web application in which users' content is indexed by Solr and then presented in a dashboard-type interface. Eventually, the requirements should allow users to upload a file of say 50 megs (some day 1 gig) and have their content…
ted.strauss
  • 4,119
  • 4
  • 34
  • 57
1
vote
0 answers

Why does Map Jobs slow down after first set of Mappers are completed?

Say I have 100 mappers running in parallel and there are total 500 mappers running. Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical. But say first 100 mappers…
Amar
  • 11,930
  • 5
  • 50
  • 73
1
vote
1 answer

Easiest way to get started with Hadoop

I was looking for the simplest way(s) to submit a MapReduce job. I am looking for a platform similar in complexity (or simplicity) such a Heroku (is to Ruby) or picloud.com is to map. The idea is where a beginner can submit a MapReduce job without…
user1172468
  • 5,306
  • 6
  • 35
  • 62
1
vote
2 answers

Pig on EMR trouble with piggybank and AvroStorage

I'm running a pig script on EMR that reads data stored in Avro format. It had been working locally, but to get other parts of the script to work on EMR, I had to revert the piggybank.jar I was using to 0.9.2 instead of 0.10.0. After making that…
Joe K
  • 18,204
  • 2
  • 36
  • 58
1
vote
1 answer

How can I share jar libraries with amazon elastic mapreduce?

To speedup jar to s3 uploading I want to copy all my common jar to something like "$HADOOP_HOME/lib" in normal hadoop. Is it possible for me to create custom EMR hadoop instance with these libraries preinstalled. Or there are easier way?
yura
  • 14,489
  • 21
  • 77
  • 126
1
vote
0 answers

Why are half of my "word count" Hadoop Reducer output files 0 bytes when run on AWS/EMR?

I have a set of data that is basically the Mapping results of a simple Word Count (text files w/ word & count pairs, tab delimited), and I need to reduce it. There's about 160 GB of data, compressed into bz2 files. When I run my job on Amazon Web…
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
1
vote
0 answers

ElasticMapReduce: Is it possible to reuse already allocated EMR cluster?

I specified --alive option in EMR CLI when I created a new cluster and I am wondering if it is possible to reuse the cluster in launching another job? I can't find any relevant option to get some kind of ID for the cluster? So does that mean that it…
kee
  • 10,969
  • 24
  • 107
  • 168
1
vote
1 answer

Are Apache HBase and Cloudera HBase compatible?

At work we are attempting to do the following: Run Elastic MapReduce jobs via Amazon, which freezes Hadoop at version 0.20.205 Write output to HBase running on EC2, specifically, 0.92.1-cdh4.0.1 from Cloudera What I've discovered so far is my…
brianz
  • 7,268
  • 4
  • 37
  • 44