Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
15
votes
2 answers

Difference in calling the job

what is the difference between calling a mapreduce job from main() and from ToolRunner.run()? When we say that the main class say, MapReduce extends Configured implements Tool , what are the additional privileges we get which we do not have if we…
Ravi Trivedi
  • 527
  • 1
  • 5
  • 12
15
votes
4 answers

Getting Started with Avro

I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search.
Sri
  • 201
  • 2
  • 4
  • 6
15
votes
3 answers

Hadoop on windows server

I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM) The questions are: Is there any good tutorial on how to configure an hadoop cluster on windows? What are…
Luca Martinetti
  • 3,396
  • 6
  • 34
  • 49
15
votes
7 answers

/bin/bash: /bin/java: No such file or directory error in Yarn apps in MacOS

I was trying to run a simple wordcount MapReduce Program using Java 1.7 SDK and Hadoop2.7.1 on Mac OS X EL Captain 10.11 and I am getting the following error message in my container log "stderr" /bin/bash: /bin/java: No such file or…
Gangadhar Kadam
  • 536
  • 1
  • 4
  • 15
15
votes
4 answers

creating partition in external table in hive

I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps: 1-created a source table 2-loaded data from local into source table 3- created another table with partitions - partition_table 4-…
Anoop Mamgain
  • 187
  • 2
  • 3
  • 13
15
votes
2 answers

Why is Spark faster than Hadoop Map Reduce

Can someone explain using the word count example, why Spark would be faster than Map Reduce?
Victor
  • 16,609
  • 71
  • 229
  • 409
15
votes
2 answers

Hadoop Mapper is failing because of "Container killed by the ApplicationMaster"

I am trying to execute a map reduce program on Hadoop. When i submit my job to the hadoop single node cluster. The job is getting created but failing with the message "Container killed by the ApplicationMaster" The input used is of the size 10…
Harry
  • 253
  • 1
  • 6
  • 19
15
votes
3 answers

How to use Cassandra's Map Reduce with or w/o Pig?

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client"…
Brent
  • 23,354
  • 10
  • 44
  • 49
15
votes
3 answers

MongoDB Aggregation Framework performance slow over millions of documents

background Our system is carrier grade and extremely robust, it has been load tested to handle 5000 transactions per second, and for each transaction a document is inserted into a single MongoDB collection (no updates or queries in this application,…
Ashley Brener
  • 268
  • 3
  • 12
15
votes
4 answers

Hive ParseException - cannot recognize input near 'end' 'string'

I am getting the following error when trying to create a Hive table from an existing DynamoDB table: NoViableAltException(88@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:9123) at…
Jens Roland
  • 27,450
  • 14
  • 82
  • 104
15
votes
5 answers

How to build OpenCV with Java under Linux using command line?(Gonna use it in MapReduce)

Recently I'm trying OpenCV out for my graduation project. I've had some success under Windows enviroment. And because with Windows package of OpenCV it comes with pre-built libraries, so I don't have to worry about how to build them. But since the…
user2535650
  • 265
  • 1
  • 2
  • 8
14
votes
7 answers

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value. Sample Input : one,first line two,second line Ouput Required : Key : one Value : first…
Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48
14
votes
2 answers

State of Map-Reduce on Appengine?

There is appengine-mapreduce which seems the official way to do things on AppEngine. But there seems no documentation besides some hacked together Wiki Pages and lengthy videos. There are statements that the lib only supports the map step. But the…
max
  • 29,122
  • 12
  • 52
  • 79
14
votes
2 answers

How to customize Writable class in Hadoop?

I'm trying to implement Writable class, but i have no idea on how to implement a writable class if in my class there is nested object, such as list, etc. Could any body help me? thanks public class StorageClass implements Writable{ public String…
afancy
  • 673
  • 4
  • 10
  • 18
14
votes
1 answer

How to translate from SQL to NoSQL/MapReduce?

I have a background working with relational databases but recently started to dabble in CouchDB and was surprised by how some non-relational operations, which would be simple in SQL, were not first-class functions in CouchDB. I would appreciate you…
sferik
  • 1,795
  • 2
  • 15
  • 22