Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
0 answers

Indexing a file contents in ElasticSearch

**I have a text file which contains some names like below: Tom, Harry Robert Harry Matt Tremp I want to index those names in ElasticSearch using JAVA APIs which should index all the names automatically. Can anybody suggest any solution as I am new…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
1
vote
2 answers

How to implement the combiner in Hadoop MapReduce?

I understand that for including a combiner in Hadoop MapReduce the following line is included (which I have done already); conf.setCombinerClass(MyReducer.class); What I don't understand is that where do I actually implement the functionality of…
ali
  • 23
  • 5
1
vote
1 answer

Writing to DSE from an external Pig Job (Pig -> DSE connector)

I'm trying to write an EMR job running Pig that writes to DSE which we'll be using for serving. Unfortunately, I can't get Pig to write to DSE so I've broken down the problem to just connecting to the DSE node and try to write to it. Here's what I'm…
1
vote
1 answer

how to make aws elastic mapreduce hive commands run in parallel

I reviewed here, How to make hive run mapreduce jobs concurrently? My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup? Also, is setting this option equivalent to doing something like the…
Patrick McCann
  • 484
  • 4
  • 11
1
vote
2 answers

Hive Query Number of mappers always 1

Im trying to run a simple query on a table with one partition which has around 200-300k records all of them are small files of 120bytes. I'm using a custom INPUTFORMAT which reads the file contents and then query another s3 file to fetch the actual…
Ravi
  • 41
  • 1
  • 4
1
vote
4 answers

Combine output files of MapReduce job

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming. The final result folder contains the output in three different files part-00000, part-00001 and part-00002.…
1
vote
2 answers

Lauching a map reduce job in amazon elastic map reduce

I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's…
Bala
  • 675
  • 2
  • 7
  • 23
1
vote
1 answer

How to get data from S3 and use them for Elastic map reduce/ where to write codes?

I have two big files and have uploaded them into an Amazon S3 bucket named "ccssdd" and created a folder named data: data/friendships.xml data/users.xml structure of users is 1 24 4 7
Shane
  • 128
  • 1
  • 3
  • 15
1
vote
2 answers

Which node sort/shuffle the keys in Hadoop?

In a Hadoop job, which node does the sorting/shuffling phase? Does increasing the memory of that node improve the performance of sorting/shuffling?
HHH
  • 6,085
  • 20
  • 92
  • 164
1
vote
1 answer

MapReduce Amazon Python Get the line umber of the input file

I have several texts and I want to know the line number and the file where appears a word. I got the file well but not the line number. This is the map #!/usr/bin/env python import sys import os find = 'but' #word to find linesCont = 0 file =…
Carlos S
  • 13
  • 3
1
vote
1 answer

Allow more than one hadoop/EMR tasks to fail before shutting down

I'm trying to use hadoop on Amazon Elastic MapReduce where I have thousands of map tasks to perform. I'm OK if a small percentage of the tasks fail, however, Amazon shuts down the job and I lose all of the results when the first mapper fails. Is…
1
vote
3 answers

EMR - create user log from log

EMR Newbie Alert: We have large logs containing the usage data of our web site. Customers are authenticated and identified by their customer id. Whenever we try to troubleshoot a customer issue we grep through all the logs (using the customer_id as…
1
vote
1 answer

Number of region servers on Amazon AWS

Say I start an cluster on Amazon elastic mapreduce and have one Master node instance, 2 core node instances and 15 task node instances. I think I uploaded around 1 TB of data into hbase using mapreduce jobs and incremental uploads. Now - How do I…
Run2
  • 1,839
  • 22
  • 32
1
vote
1 answer

Can I write mapper and reducer program in different language

I felt doing my Mapper operation in Perl script but then I realized it would be easier to write Reducer in Python. Can Mapper and Reducer can work in different programming language?
CtrlV
  • 115
  • 11
1
vote
1 answer

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto. What I haven't figured out is how to distribute python scripts (or any…