Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : emr

452 questions

vote

0 answers

MapReduce job skipping / resizing?

I'm running a job on the whole commoncrawl corpus, and ran into this hiccup in the remaining map tasks. Anyone know what would cause that? I suspect it has something to do with this "resizing complete" message, but am not sure why the job would stop…

hadoop amazon-web-services mapreduce elastic-map-reduce

asked Jan 17 '13 at 01:05

kelorek

6,042
6
29
32

vote

1 answer

Hive / Map-Reduce Job on a Hadoop cluster: How to (roughly) calculate the diskspace needed?

following use case: I run a hive query on data which has about 500GB size in .gz compression: select count(distinct c1), c2 from t1 group by c2; This query results in ~2800 map jobs and ~400 reduce jobs. When setting up a Hadoop cluster with 20…

hadoop mapreduce hive hdfs elastic-map-reduce

asked Jan 16 '13 at 10:33

saschor

vote

2 answers

Reading large files using mapreduce in hadoop

I have a code that reads files from FTP server and writes it into HDFS. I have implemented a customised InputFormatReader that sets the isSplitable property of the input as false .However this gives me the following error. INFO mapred.MapTask:…

java hadoop mapreduce elastic-map-reduce amazon-emr

asked Dec 31 '12 at 09:56

RadAl

vote

2 answers

Producing ngram frequencies for a large dataset

I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes. What's the best and most cost efficient way to do it in the next day or so? My thoughts are: PostgreSQL using…

postgresql hadoop mapreduce bigdata elastic-map-reduce

asked Dec 06 '12 at 15:38

Max

2,760
1
28
47

vote

1 answer

EC2 parallel processing custom packages using segue

I am using the R segue package (downloadable from here) to carry out parallel processing. I would like to source a package to be installed when setting up clusters. The package is my own that I have made, and I have converted it into a tar.gz file…

r parallel-processing amazon-ec2 elastic-map-reduce

asked Nov 18 '12 at 21:06

h.l.m

13,015
22
82
169

vote

0 answers

Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143"

My hadoop streaming map-reduce jobs on Amazon EMR keep failing with the following error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143 at…

hadoop hadoop-streaming elastic-map-reduce

asked Nov 18 '12 at 16:28

Michael Barton

8,868
9
35
43

vote

2 answers

What ports does Apache Hadoop version 1.0.3 use for intracluster communicaion of the daemons

I know port 22 is only used for control scripts. But i need to know what ports I should open for my 3 node cluster. 2 slaves, 1 namenode/jobtracker. On what port do the daemons run? On what ports are the URLs displayed? The hadoop distro is: Apache…

hadoop mapreduce hbase rhel elastic-map-reduce

asked Nov 15 '12 at 20:03

user836087

2,271
8
23
33

vote

0 answers

MapReduce and similar methods to accelerate indexing of Apache Solr

I am building a web application in which users' content is indexed by Solr and then presented in a dashboard-type interface. Eventually, the requirements should allow users to upload a file of say 50 megs (some day 1 gig) and have their content…

solr mapreduce elasticsearch elastic-map-reduce

asked Nov 14 '12 at 17:01

ted.strauss

4,119
4
34
57

vote

0 answers

Why does Map Jobs slow down after first set of Mappers are completed?

Say I have 100 mappers running in parallel and there are total 500 mappers running. Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical. But say first 100 mappers…

hadoop mapreduce elastic-map-reduce emr

asked Nov 06 '12 at 23:14

Amar

11,930
5
50
73

vote

1 answer

Easiest way to get started with Hadoop

I was looking for the simplest way(s) to submit a MapReduce job. I am looking for a platform similar in complexity (or simplicity) such a Heroku (is to Ruby) or picloud.com is to map. The idea is where a beginner can submit a MapReduce job without…

hadoop elastic-map-reduce

asked Nov 03 '12 at 12:46

user1172468

5,306
6
35
62

vote

2 answers

Pig on EMR trouble with piggybank and AvroStorage

I'm running a pig script on EMR that reads data stored in Avro format. It had been working locally, but to get other parts of the script to work on EMR, I had to revert the piggybank.jar I was using to 0.9.2 instead of 0.10.0. After making that…

hadoop apache-pig elastic-map-reduce amazon-emr avro

asked Sep 20 '12 at 22:39

Joe K

18,204
2
36
58

vote

1 answer

How can I share jar libraries with amazon elastic mapreduce?

To speedup jar to s3 uploading I want to copy all my common jar to something like "$HADOOP_HOME/lib" in normal hadoop. Is it possible for me to create custom EMR hadoop instance with these libraries preinstalled. Or there are easier way?

hadoop amazon-ec2 elastic-map-reduce

asked Aug 28 '12 at 05:22

yura

14,489
21
77
126

vote

0 answers

Why are half of my "word count" Hadoop Reducer output files 0 bytes when run on AWS/EMR?

I have a set of data that is basically the Mapping results of a simple Word Count (text files w/ word & count pairs, tab delimited), and I need to reduce it. There's about 160 GB of data, compressed into bz2 files. When I run my job on Amazon Web…

java hadoop amazon-web-services elastic-map-reduce

asked Aug 19 '12 at 16:25

Dolan Antenucci

15,432
17
74
100

vote

0 answers

ElasticMapReduce: Is it possible to reuse already allocated EMR cluster?

I specified --alive option in EMR CLI when I created a new cluster and I am wondering if it is possible to reuse the cluster in launching another job? I can't find any relevant option to get some kind of ID for the cluster? So does that mean that it…

hadoop elastic-map-reduce

asked Aug 17 '12 at 05:43

kee

10,969
24
107
168

vote

1 answer

Are Apache HBase and Cloudera HBase compatible?

At work we are attempting to do the following: Run Elastic MapReduce jobs via Amazon, which freezes Hadoop at version 0.20.205 Write output to HBase running on EC2, specifically, 0.92.1-cdh4.0.1 from Cloudera What I've discovered so far is my…

hadoop hbase cloudera elastic-map-reduce

asked Aug 07 '12 at 19:48

brianz

7,268
4
37
44

Prev 1 2 3

…

30 31 Next