Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
14
votes
3 answers

How to sum fields of collection elements without mapping them first (like foldLeft/reduceLeft)?

Consider this class: case class Person(val firstName: String, val lastName: String, age: Int) val persons = Person("Jane", "Doe", 42) :: Person("John", "Doe", 45) :: Person("Joe", "Doe", 43) :: Person("Doug", "Don", 65) :: …
soc
  • 27,983
  • 20
  • 111
  • 215
14
votes
3 answers

Hadoop Map Reduce: Algorithms

Can someone point me to a good web site with good collection of Hadoop algorithms. For example, the most complex thing that I can do with Hadoop right now is Page Rank. Other than that, I can do trivial things like word counting and stuff. I want…
denniss
  • 17,229
  • 26
  • 92
  • 141
14
votes
2 answers

How does Apache Flink compare to Mapreduce on Hadoop?

How does Apache Flink compare to Mapreduce on Hadoop? In what ways it's better and why?
Shu
  • 153
  • 1
  • 7
14
votes
2 answers

How to calculate the running total using aggregate?

I'm developing a simple financial app for keeping track of incomes and outcomes. For the sake of simplicity, let's suppose these are some of my documents: { description: "test1", amount: 100, dateEntry: ISODate("2015-01-07T23:00:00Z") } {…
Fabio B.
  • 9,138
  • 25
  • 105
  • 177
14
votes
3 answers

Using Hadoop, are my reducers guaranteed to get all the records with the same key?

I'm running a Hadoop job using Hive actually that is supposed to uniq lines in many text files. In the reduce step, it chooses the most recently timestamped record for each key. Does Hadoop guarantee that every record with the same key, output by…
samg
  • 3,496
  • 1
  • 25
  • 26
14
votes
2 answers

Wordcount program is stuck in hadoop-2.3.0

I installed hadoop-2.3.0 and tried to run wordcount example But it starts the job and sits idle hadoop@ubuntu:~$ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount /myprg…
USB
  • 6,019
  • 15
  • 62
  • 93
14
votes
2 answers

Hadoop FileSystem closed exception when doing BufferedReader.close()

From within the Reduce setup method,I am trying to close a BufferedReader object and getting a FileSystem closed exception. It does not happen all the time. This is the piece of code I used to create the BufferedReader. String fileName =
Venk K
  • 1,157
  • 5
  • 14
  • 25
14
votes
3 answers

Mapper input Key-Value pair in Hadoop

Normally, we write the mapper in the form : public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable> Here the input key-value pair for the mapper is - as far as I know when the mapper gets the input…
Ronin
  • 2,027
  • 8
  • 32
  • 39
14
votes
3 answers

On what basis mapreduce framework decides whether to launch a combiner or not

As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer." I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.
banjara
  • 3,800
  • 3
  • 38
  • 61
14
votes
2 answers

Using Hadoop for Parallel Processing rather than Big Data

I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a…
Snowpoch
  • 143
  • 1
  • 1
  • 4
14
votes
2 answers

Analytics and Mining of data sitting on Cassandra

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying.…
NG Algo
  • 3,570
  • 2
  • 18
  • 27
13
votes
4 answers

How to use MATLAB code in mapper (Hadoop)?

I have a matlab code that processes images. I want to create a Hadoop mapper that uses that code. I came across the following solutions but not sure which one is best (as it is very difficult to install matlab compiler runtime on each slave node in…
Harsh
  • 265
  • 6
  • 18
13
votes
3 answers

Hadoop: how to access (many) photo images to be processed by map/reduce?

I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I…
leslie
  • 11,858
  • 7
  • 23
  • 22
13
votes
1 answer

CouchDB "Join" two documents

I have two documents that looks a bit like so: Doc { _id: AAA, creator_id: ..., data: ... } DataKey { _id: ..., credits_left: 500, times_used: 0, data_id: AAA } What I want to do is create a view which would allow me to pass the…
Obto
  • 1,377
  • 1
  • 20
  • 33
13
votes
1 answer

Map Reduce with F# agents

After playing with F# agents I tried to do a map reduce using them. The basic structure I use is: map supervisor which queues up all the work to do in its state and receives work request from map workers reduce supervisor does the same thing as…
jlezard
  • 1,417
  • 2
  • 15
  • 32