Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
47
votes
3 answers

Is it better to use the mapred or the mapreduce package to create a Hadoop Job?

To create MapReduce jobs you can either use the old org.apache.hadoop.mapred package or the newer org.apache.hadoop.mapreduce package for Mappers and Reducers, Jobs ... The first one had been marked as deprecated but this got reverted meanwhile. Now…
momo13
  • 473
  • 1
  • 4
  • 6
46
votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

Why are there two separate packages map-reduce package in Apache's hadoop package tree: org.apache.hadoop.mapred…
bartonm
  • 1,600
  • 3
  • 18
  • 30
45
votes
3 answers

Explode the Array of Struct in Hive

This is the below Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY> ) And this is the data in the above table- 1015826235 …
arsenal
  • 23,366
  • 85
  • 225
  • 331
44
votes
11 answers

How to get the input file name in the mapper in a Hadoop program?

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.
HHH
  • 6,085
  • 20
  • 92
  • 164
44
votes
1 answer

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000?…
ravi
  • 6,140
  • 18
  • 77
  • 154
43
votes
4 answers

How to write 'map only' hadoop jobs?

I'm a novice on hadoop, I'm getting familiar to the style of map-reduce programing but now I faced a problem : Sometimes I need only map for a job and I only need the map result directly as output, which means reduce phase is not needed here, how…
Breakinen
  • 619
  • 2
  • 7
  • 13
42
votes
4 answers

MongoDB: Terrible MapReduce Performance

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long. I have a database table in MySQL that tracks the…
mellowsoon
  • 22,273
  • 19
  • 57
  • 75
41
votes
15 answers

Setting the number of map tasks and reduce tasks

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not…
asembereng
  • 675
  • 2
  • 8
  • 18
41
votes
10 answers

How does Hadoop perform input splits?

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form where k is the offset of the line from the beginning and…
Deepak
  • 2,003
  • 6
  • 30
  • 32
39
votes
5 answers

List the namenode and datanodes of a cluster from any node?

From any node in a Hadoop cluster, what is the command to identify the running namenode? identify all running datanodes? I have looked through the commands manual and have not found this.
T. Webster
  • 9,605
  • 6
  • 67
  • 94
39
votes
1 answer

MongoDB aggregation comparison: group(), $group and MapReduce

I am somewhat confused about when to use group(), aggregate with $group or mapreduce. I read the documentation at http://www.mongodb.org/display/DOCS/Aggregation for group(), http://docs.mongodb.org/manual/reference/aggregation/group/#_S_group for…
Aafreen Sheikh
  • 4,949
  • 6
  • 33
  • 43
37
votes
3 answers

What is Google's Dremel? How is it different from Mapreduce?

Google's Dremel is described here. What's the difference between Dremel and Mapreduce?
Yktula
  • 14,179
  • 14
  • 48
  • 71
37
votes
8 answers

Hadoop DistributedCache is deprecated - what is the preferred API?

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new…
DNA
  • 42,007
  • 12
  • 107
  • 146
36
votes
6 answers

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once…
KARASZI István
  • 30,900
  • 8
  • 101
  • 128
35
votes
1 answer

Best way to do one-to-many "JOIN" in CouchDB

I am looking for a CouchDB equivalent to "SQL joins". In my example there are CouchDB documents that are list elements: { "type" : "el", "id" : "1", "content" : "first" } { "type" : "el", "id" : "2", "content" : "second" } { "type" : "el", "id" :…
mit
  • 11,083
  • 11
  • 50
  • 74