Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
17
votes
1 answer

Where does job.setOutputKeyClass and job.setOutputReduceClass refers to?

I thought that they refer to the Reducer but in my program I have public static class MyMapper extends Mapper< LongWritable, Text, Text, Text > and public static class MyReducer extends Reducer< Text, Text, NullWritable,…
nik686
  • 705
  • 3
  • 9
  • 17
17
votes
2 answers

Why do we need Hadoop passwordless ssh?

AFAIK, passwordless ssh is needed so that the master node can start the daemon processes on each slave node. Apart from that, is there any use of having passwordless ssh for Hadoop's operation? How are the user code jars and data chunks…
Tejas Patil
  • 6,149
  • 1
  • 23
  • 38
17
votes
5 answers

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way: In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager…
Abhishek Jain
  • 4,478
  • 8
  • 34
  • 51
17
votes
2 answers

CouchDB: map-reduce in Erlang

How can I write map-reduce functions in Erlang for CouchDB? I am sure Erlang is faster than JavaScript.
edbond
  • 3,921
  • 19
  • 26
17
votes
2 answers

“Combiner" Class in a mapreduce job

A Combiner runs after the Mapper and before the Reducer,it will receive as input all data emitted by the Mapper instances on a given node. then emits output to the Reducers. And also,If a reduce function is both commutative and associative, then it…
wayen wan
  • 207
  • 1
  • 2
  • 7
16
votes
1 answer

All three constructors of org.apache.hadoop.mapreduce.Job are deprecated, what is the best way to construct a Job class?

All three constructors of org.apache.hadoop.mapreduce.Job are deprecated, is there a way to construct a Job class the non-deprecated way? Thanks.
icycandy
  • 1,193
  • 2
  • 12
  • 20
16
votes
2 answers

MongoDB map/reduce over multiple collections?

First, the background. I used to have a collection logs and used map/reduce to generate various reports. Most of these reports were based on data from within a single day, so I always had a condition d: SOME_DATE. When the logs collection grew…
ibz
  • 44,461
  • 24
  • 70
  • 86
16
votes
4 answers

MultipleOutputFormat in hadoop

I'm a newbie in Hadoop. I'm trying out the Wordcount program. Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it.…
raj
  • 3,769
  • 4
  • 25
  • 43
16
votes
3 answers

Split size vs Block size in Hadoop

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?
duong_dajgja
  • 4,196
  • 1
  • 38
  • 65
16
votes
2 answers

How to define avro schema for complex json document?

I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ …
user2727704
  • 625
  • 1
  • 10
  • 21
16
votes
6 answers

YARN Resourcemanager not connecting to nodemanager

thanks in advance for any help I am running the following versions: Hadoop 2.2 zookeeper 3.4.5 Hbase 0.96 Hive 0.12 When I go to http://:50070 I am able to correctly see that 2 nodes are running. The problem is when I go to http://:8088 it shows 0…
Aman Chawla
  • 704
  • 2
  • 8
  • 25
16
votes
4 answers

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
DilanG
  • 1,197
  • 1
  • 26
  • 42
16
votes
1 answer

Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text

I am trying to tweak an existing problem to suit my needs.. Basically input is simple text I process it and pass key/value pair to reducer And I create a json.. so there is key but no value So mapper: Input: Text/Text Output: Text/Text Reducer:…
frazman
  • 32,081
  • 75
  • 184
  • 269
16
votes
2 answers

Hadoop: How can i merge reducer outputs to a single file?

I know that "getmerge" command in shell can do this work. But what should I do if I want to merge these outputs after the job by HDFS API for java? What i actually want is a single merged file on HDFS. The only thing i can think of is to start an…
thomaslee
  • 407
  • 1
  • 7
  • 21
16
votes
1 answer

Hive enforces schema during read time?

What is the difference and meaning of these two statements that I encountered during a lecture here: 1. Traditional databases enforce schema during load time. and 2. Hive enforces schema during read time.
London guy
  • 27,522
  • 44
  • 121
  • 179