Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
20
votes
4 answers

Hadoop Distribution Differences

Can somebody outline the various differences between the various Hadoop Distributions available: Cloudera - http://www.cloudera.com/hadoop Yahoo - http://developer.yahoo.net/blogs/hadoop/ using the Apache Hadoop distro as a baseline. Is there a…
Jonathan Holloway
  • 62,090
  • 32
  • 125
  • 150
20
votes
5 answers

MongoDB MapReduce - Emit one key/one value doesnt call reduce

So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk) Say I have objects in my collection like so: {'key':5, 'value':5} {'key':5, 'value':4} {'key':5, 'value':1} {'key':4, 'value':6} {'key':4,…
IamAlexAlright
  • 1,500
  • 1
  • 16
  • 29
19
votes
3 answers

identityreducer in the new Hadoop API

I spent almost a day but couldn't figure out how to use IdentityReducer in the new Hadoop API. All references or classes I can find are with the old API. And obviously mixing up old API idetntitreducer class in the new API codebase doesn't go well.…
kee
  • 10,969
  • 24
  • 107
  • 168
19
votes
2 answers

Merging two collections in MongoDB

I've been trying to use MapReduce in MongoDB to do what I think is a simple procedure. I don't know if this is the right approach, of if I should even be using MapReduce. I googled what keywords I thought of and tried to hit the docs where I thought…
TFX
  • 243
  • 1
  • 2
  • 5
19
votes
4 answers

Hadoop gzip compressed files

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems…
Boolean
  • 14,266
  • 30
  • 88
  • 129
19
votes
11 answers

Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0

I have setup a 2 node cluster of Hadoop 2.3.0. Its working fine and I can successfully run distributedshell-2.2.0.jar example. But when I try to run any mapreduce job I get error. I have setup MapRed.xml and other configs for running MapReduce job…
TonyMull
  • 271
  • 1
  • 4
  • 12
19
votes
2 answers

what is the basic difference between jobconf and job?

hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf,…
user1585111
  • 1,019
  • 6
  • 19
  • 35
19
votes
2 answers

How does partitioning in MapReduce exactly work?

I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results. I…
user1494080
  • 2,064
  • 2
  • 17
  • 36
19
votes
4 answers

Difference between Hadoop Map Reduce and Google Map Reduce

What is the difference between Hadoop Map Reduce and Google Map Reduce? Is it just Hadoop provides standardization for map reduce and others? what else is amongst the diff.
Monica Shiralkar
  • 283
  • 1
  • 3
  • 15
19
votes
1 answer

Distributed local clustering coefficient algorithm (MapReduce/Hadoop)

I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code…
alien01
  • 1,334
  • 2
  • 14
  • 31
18
votes
2 answers

What is the fastest way to bulk load data into HBase programmatically?

I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java client). My current solution is based on a MapReduce job without the Reduce…
Cihan Keser
  • 3,190
  • 4
  • 30
  • 43
18
votes
1 answer

Passing arguments to Hadoop mappers

I'm using new Hadoop API and looking for a way to pass some parameters (few strings) to mappers. How can I do that? This solutions works for old API: JobConf job = (JobConf)getConf(); job.set("NumberOfDocuments", args[0]); Here,…
wlk
  • 5,695
  • 6
  • 54
  • 72
18
votes
1 answer

compute bootstrapping algorithm using Map/Reduce

This question was originally a homework assignment I had, but my answer was wrong, and I'm curious what is the best solution for this problem. The goal is to compute key aspects of the "Recommender System bootstrapping algorithm" using 4 map reduce…
amit
  • 175,853
  • 27
  • 231
  • 333
18
votes
3 answers

Container killed by the ApplicationMaster Exit code is 143

I've been getting the following error in several cases: 2017-03-23 11:55:10,794 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1490079327128_0048_r_000003_0:…
Yuval
  • 211
  • 1
  • 2
  • 6
18
votes
2 answers

Got InterruptedException while executing word count mapreduce job

I have installed Cloudera VM version 5.8 on my machine. When I execute word count mapreduce job, it throws below exception. `16/09/06 06:55:49 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native…