Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

1 answer

Where does job.setOutputKeyClass and job.setOutputReduceClass refers to?

I thought that they refer to the Reducer but in my program I have public static class MyMapper extends Mapper< LongWritable, Text, Text, Text > and public static class MyReducer extends Reducer< Text, Text, NullWritable,…

java hadoop mapreduce

asked Jan 08 '13 at 22:32

nik686

votes

2 answers

Why do we need Hadoop passwordless ssh?

AFAIK, passwordless ssh is needed so that the master node can start the daemon processes on each slave node. Apart from that, is there any use of having passwordless ssh for Hadoop's operation? How are the user code jars and data chunks…

hadoop mapreduce

asked Dec 17 '12 at 06:54

Tejas Patil

6,149
1
23
38

votes

5 answers

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way: In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager…

hadoop mapreduce hadoop-yarn

asked Oct 20 '12 at 21:13

Abhishek Jain

4,478
8
34
51

votes

2 answers

CouchDB: map-reduce in Erlang

How can I write map-reduce functions in Erlang for CouchDB? I am sure Erlang is faster than JavaScript.

erlang couchdb mapreduce

asked Jul 23 '09 at 08:20

edbond

3,921
19
26

votes

2 answers

“Combiner" Class in a mapreduce job

A Combiner runs after the Mapper and before the Reducer,it will receive as input all data emitted by the Mapper instances on a given node. then emits output to the Reducers. And also,If a reduce function is both commutative and associative, then it…

hadoop mapreduce reducers combiners

asked Apr 19 '12 at 01:19

wayen wan

votes

1 answer

All three constructors of org.apache.hadoop.mapreduce.Job are deprecated, what is the best way to construct a Job class?

All three constructors of org.apache.hadoop.mapreduce.Job are deprecated, is there a way to construct a Job class the non-deprecated way? Thanks.

hadoop mapreduce deprecated

asked Mar 23 '11 at 03:47

icycandy

1,193
2
12
20

votes

2 answers

MongoDB map/reduce over multiple collections?

First, the background. I used to have a collection logs and used map/reduce to generate various reports. Most of these reports were based on data from within a single day, so I always had a condition d: SOME_DATE. When the logs collection grew…

mongodb mapreduce

asked Oct 01 '10 at 08:01

ibz

44,461
24
70
86

votes

4 answers

MultipleOutputFormat in hadoop

I'm a newbie in Hadoop. I'm trying out the Wordcount program. Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it.…

java hadoop mapreduce

asked Aug 16 '10 at 06:42

raj

3,769
4
25
43

votes

3 answers

Split size vs Block size in Hadoop

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

hadoop mapreduce hdfs

asked May 30 '15 at 17:33

duong_dajgja

4,196
1
38
65

votes

2 answers

How to define avro schema for complex json document?

I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ …

json serialization mapreduce avro

asked Jan 27 '15 at 04:24

user2727704

votes

6 answers

YARN Resourcemanager not connecting to nodemanager

thanks in advance for any help I am running the following versions: Hadoop 2.2 zookeeper 3.4.5 Hbase 0.96 Hive 0.12 When I go to http://:50070 I am able to correctly see that 2 nodes are running. The problem is when I go to http://:8088 it shows 0…

hadoop mapreduce hadoop-yarn resourcemanager

asked Dec 16 '13 at 21:08

Aman Chawla

votes

4 answers

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.

hadoop mapreduce

asked Sep 03 '13 at 06:47

DilanG

1,197
1
26
42

votes

1 answer

Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text

I am trying to tweak an existing problem to suit my needs.. Basically input is simple text I process it and pass key/value pair to reducer And I create a json.. so there is key but no value So mapper: Input: Text/Text Output: Text/Text Reducer:…

java hadoop mapreduce

asked Jun 04 '13 at 20:22

frazman

32,081
75
184
269

votes

2 answers

Hadoop: How can i merge reducer outputs to a single file?

I know that "getmerge" command in shell can do this work. But what should I do if I want to merge these outputs after the job by HDFS API for java？ What i actually want is a single merged file on HDFS. The only thing i can think of is to start an…

java hadoop merge mapreduce hdfs

asked Oct 16 '12 at 09:44

thomaslee

votes

1 answer

Hive enforces schema during read time?

What is the difference and meaning of these two statements that I encountered during a lecture here: 1. Traditional databases enforce schema during load time. and 2. Hive enforces schema during read time.

hadoop mapreduce hive hdfs

asked Aug 01 '12 at 17:13

London guy

27,522
44
121
179

Prev 1 2 3

…

99 100 Next