Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
12
votes
8 answers

Hadoop Mapreduce Error Input path does not exist: hdfs://localhost:54310/user/hduser/input"

I have installed hadoop 2.6 in Ubuntu Linux 15.04 and its running fine. But, when I am running a sample test mapreduce program, its giving the following error: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:…
kan1969
  • 121
  • 1
  • 2
  • 5
12
votes
4 answers

how to access zeroth element in reduce to count repeats in an array

On the whim of node school, I am trying to use reduce to count the number of times a string is repeated in an array. var fruits = ["Apple", "Banana", "Apple", "Durian", "Durian", "Durian"], obj = {}; fruits.reduce(function(prev, curr, index,…
1252748
  • 14,597
  • 32
  • 109
  • 229
12
votes
1 answer

When to use map reduce over Aggregation Pipeline in MongoDB?

While looking at documentation for map-reduce, I found that: NOTE: For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that…
Dev
  • 13,492
  • 19
  • 81
  • 174
12
votes
5 answers

When to prefer Hadoop MapReduce over Spark?

very simple questions: in which cases should I prefer Hadoop MapReduce over Spark? (I hope this question has not been asked yet - at least I didn't find it...) I am currently doing a comparison of those two processing frameworks and from what I have…
Daniel
  • 2,409
  • 2
  • 26
  • 42
12
votes
2 answers

Mongo Map Reduce first time

First time Map/Reduce user here, and using MongoDB. I have a lot of page visit data which I'd like to make some sense of by using Map/Reduce. Below is basically what I want to do, but as a total beginner a Map/Reduce, I think this is above my…
James
  • 5,942
  • 15
  • 48
  • 72
12
votes
1 answer

To make a distance matrix or to repeatedly calculate distance

I'm working on K-medoids algorithm implementation. It is a clustering algorithm and one of its steps includes finding the most representative point in a cluster. So, here's the thing I have a certain number of clusters Each cluster contains a…
Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67
12
votes
5 answers

TypeError: list indices must be integers, not str Python

list[s] is a string. Why doesn't this work? The following error appears: TypeError: list indices must be integers, not str list = ['abc', 'def'] map_list = [] for s in list: t = (list[s], 1) map_list.append(t)
kerschi
  • 137
  • 1
  • 2
  • 10
12
votes
2 answers

What is Keyword Context in Hadoop programming world?

What exactly is this keyword Context in Hadoop MapReduce world in new API terms? Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use…
Brijesh
  • 131
  • 1
  • 1
  • 5
12
votes
2 answers

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there. I have a column family with about a million records. I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation. I Googled on this and it seems I…
Babu James
  • 2,740
  • 4
  • 33
  • 50
12
votes
2 answers

What default reducers are available in Elastic MapReduce?

I hope I'm asking this in the right way. I'm learning my way around Elastic MapReduce and I've seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In Amazon's "Introduction to Amazon Elastic MapReduce"…
John
  • 3,430
  • 2
  • 31
  • 44
12
votes
3 answers

How to import a custom module in a MapReduce job?

I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py …
ffriend
  • 27,562
  • 13
  • 91
  • 132
12
votes
5 answers

How the data is split in Hadoop

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given…
HHH
  • 6,085
  • 20
  • 92
  • 164
12
votes
1 answer

Incremental MapReduce implementations (other than CouchDB, preferably)

I work on a project that sits on a large-ish pile of raw data, aggregates from which are used to power a public-facing informational site (some simple aggregates like various totals and top-tens of totals, and some somewhat-more-complicated…
Andrew Pendleton
  • 761
  • 5
  • 13
12
votes
1 answer

MapReduce in MongoDB doesn't output

I was trying to use MongoDB 2.4.3 (also tried 2.4.4) with mapReduce on a cluster with 2 shards with each 3 replicas. I have a problem with results of the mapReduce job not being reduced into output collection. I tried an Incremental Map Reduce. I…
Mark
  • 1,181
  • 6
  • 18
12
votes
2 answers

Failed to report status for 600 seconds. Killing! Reporting progress in hadoop

I receive the following error: Task attempt_201304161625_0028_m_000000_0 failed to report status for 600 seconds. Killing! for my Map jobs. This question is similar to this, this, and this. However, I do not want to increase the default time…
Sam
  • 893
  • 8
  • 21