Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
13
votes
2 answers

Compute first order derivative with MongoDB aggregation framework

Is it possible to calculate a first order derivative using the aggregate framework? For example, I have the data : {time_series : [10,20,40,70,110]} I'm trying to obtain an output like: {derivative : [10,20,30,40]}
user666
  • 5,231
  • 2
  • 26
  • 35
13
votes
2 answers

Default number of reducers

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created? Like number of mappers is dependent on (total data size)/(input split size), E.g. if data size is 1 TB and input split size is 100 MB. Then number…
Mohit Jain
  • 357
  • 2
  • 7
  • 18
13
votes
5 answers

How do you use MapReduce/Hadoop?

I'm looking for some general information about how other people are using Hadoop or other MapReduce-like technologies. In general, I am curious to whether you are writing MR applications to process existing data sets (like web server log files), or…
apavlo
  • 139
  • 1
  • 6
13
votes
2 answers

To change replication factor of a directory in hadoop

Is there any way to change the replication factor of a directory in Hadoop when I expect the change to be applicable on the files which will be written to that directory in the future?
Anish Gupta
  • 293
  • 1
  • 5
  • 18
13
votes
7 answers

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
George
  • 8,368
  • 12
  • 65
  • 106
13
votes
3 answers

MongoDB: What's the point of using MapReduce without parallelism?

Quoting http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism As of right now, MapReduce jobs on a single mongod process are single threaded. This is due to a design limitation in current JavaScript engines. We are looking…
netvope
  • 7,647
  • 7
  • 32
  • 42
13
votes
3 answers

DynamoDB: How to distribute workload over the month?

TL;DR I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics. How to work with DynamoDB in this scenario? How to…
barbolo
  • 3,807
  • 1
  • 31
  • 31
13
votes
5 answers

Spark: JavaRDD to JavaPairRDD<>

I have a JavaRDD> and need to transform it to JavaPairRDD. Currently I am doing it by simply writing map function that just returns the input tuple as is. But I wonder if there is a better way?
YuliaSh.
  • 795
  • 1
  • 6
  • 23
13
votes
1 answer

Differences between hadoop jar and yarn -jar

what's the difference between run a jar file with commands "hadoop jar " and "yarn -jar " ? I've used the "hadoop jar" command on my MAC successfully but I want be sure that the execution is being correct and parallel on my four cores. Thanks!!!
mrcf
  • 149
  • 2
  • 9
13
votes
1 answer

Load only particular field in PIG?

This is my file: Col1, Col2, Col3, Col4, Col5 I need only Col2 and Col3. Currently I'm doing this: a = load 'input' as (Col1:chararray, Col2:chararray, Col3:chararray, …
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61
13
votes
4 answers

Java 8 MapReduce for distributed computing

It made me happy when I heard about parallelStream() in Java 8, that processes on multiple cores and finally gives back the result within single JVM. No more lines of multithreading code. As far as I understand this is valid for single JVM only. But…
abishkar bhattarai
  • 7,371
  • 8
  • 49
  • 66
13
votes
1 answer

Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?

Latest build of Hadoop provides mapred-site.xml.template Do we need to create a new mapred-site.xml file using this? Any link on documentation or explanation related to Hadoop 2.2.0 will be much appreciated.
pcdhan
  • 169
  • 2
  • 7
13
votes
2 answers

how to sort word count by value in hadoop?

hi i wanted to learn how to sort the word count by value in hadoop.i know hadoop takes of sorting keys, but not by values. i know to sort the values we must have a partitioner,groupingcomparator and a sortcomparator but i am bit confused in applying…
user1585111
  • 1,019
  • 6
  • 19
  • 35
13
votes
5 answers

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on…
Kiran Karanth
  • 133
  • 1
  • 1
  • 8
13
votes
1 answer

Output a list from a Hadoop Map Reduce job using custom writable

I'm trying to create a simple map reduce job by changing the wordcount example given by hadoop. I'm trying to out put a list instead of a count of the words. The wordcount example gives the following ouput hello 2 world 2 I'm trying to get it to…
triggs
  • 5,890
  • 3
  • 32
  • 31