Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

3 answers

Is it better to use the mapred or the mapreduce package to create a Hadoop Job?

To create MapReduce jobs you can either use the old org.apache.hadoop.mapred package or the newer org.apache.hadoop.mapreduce package for Mappers and Reducers, Jobs ... The first one had been marked as deprecated but this got reverted meanwhile. Now…

hadoop mapreduce

asked Sep 29 '11 at 13:57

momo13

votes

1 answer

hadoop.mapred vs hadoop.mapreduce?

Why are there two separate packages map-reduce package in Apache's hadoop package tree: org.apache.hadoop.mapred…

apache hadoop mapreduce

asked Apr 29 '13 at 01:15

bartonm

1,600
3
18
30

votes

3 answers

Explode the Array of Struct in Hive

This is the below Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY> ) And this is the data in the above table- 1015826235 …

hadoop mapreduce hive hiveql

asked Jul 07 '12 at 08:36

arsenal

23,366
85
225
331

votes

11 answers

How to get the input file name in the mapper in a Hadoop program?

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.

hadoop mapreduce

asked Sep 25 '13 at 18:28

HHH

6,085
20
92
164

votes

1 answer

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000?…

hadoop mapreduce

asked May 19 '12 at 15:22

ravi

6,140
18
77
154

votes

4 answers

How to write 'map only' hadoop jobs?

I'm a novice on hadoop, I'm getting familiar to the style of map-reduce programing but now I faced a problem : Sometimes I need only map for a job and I only need the map result directly as output, which means reduce phase is not needed here, how…

hadoop mapreduce

asked Feb 22 '12 at 12:06

Breakinen

votes

4 answers

MongoDB: Terrible MapReduce Performance

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long. I have a database table in MySQL that tracks the…

mongodb mapreduce nosql

asked Oct 16 '10 at 06:11

mellowsoon

22,273
19
57
75

votes

15 answers

Setting the number of map tasks and reduce tasks

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not…

hadoop mapreduce

asked Jul 30 '11 at 19:16

asembereng

votes

10 answers

How does Hadoop perform input splits?

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form where k is the offset of the line from the beginning and…

hadoop mapreduce hdfs

asked May 14 '10 at 02:27

Deepak

2,003
6
30
32

votes

5 answers

List the namenode and datanodes of a cluster from any node?

From any node in a Hadoop cluster, what is the command to identify the running namenode? identify all running datanodes? I have looked through the commands manual and have not found this.

hadoop mapreduce

asked Jun 01 '13 at 03:33

T. Webster

9,605
6
67
94

votes

1 answer

MongoDB aggregation comparison: group(), $group and MapReduce

I am somewhat confused about when to use group(), aggregate with $group or mapreduce. I read the documentation at http://www.mongodb.org/display/DOCS/Aggregation for group(), http://docs.mongodb.org/manual/reference/aggregation/group/#_S_group for…

mongodb mapreduce mongodb-query aggregation-framework

asked Sep 09 '12 at 07:36

Aafreen Sheikh

4,949
6
33
43

votes

3 answers

What is Google's Dremel? How is it different from Mapreduce?

Google's Dremel is described here. What's the difference between Dremel and Mapreduce?

hadoop mapreduce google-bigquery abstraction

asked Jul 07 '11 at 08:03

Yktula

14,179
14
48
71

votes

8 answers

Hadoop DistributedCache is deprecated - what is the preferred API?

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new…

java hadoop mapreduce

asked Jan 20 '14 at 16:53

DNA

42,007
12
107
146

votes

6 answers

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once…

hadoop mapreduce

asked Feb 25 '10 at 11:34

KARASZI István

30,900
8
101
128

votes

1 answer

Best way to do one-to-many "JOIN" in CouchDB

I am looking for a CouchDB equivalent to "SQL joins". In my example there are CouchDB documents that are list elements: { "type" : "el", "id" : "1", "content" : "first" } { "type" : "el", "id" : "2", "content" : "second" } { "type" : "el", "id" :…

couchdb mapreduce

asked Jun 13 '10 at 18:43

mit

11,083
11
50
74

Prev 1 2

…

99 100 Next