Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
13
votes
6 answers

A starting point for learning how to implement MapReduce/Hadoop in Python?

I've recently started getting into data analysis and I've learned quite a bit over the last year (at the moment, pretty much exclusively using Python). I feel the next step is to begin training myself in MapReduce/Hadoop. I have no formal computer…
iRoygbiv
  • 865
  • 2
  • 7
  • 21
13
votes
10 answers

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than…
Meredith L. Patterson
  • 4,853
  • 29
  • 30
13
votes
4 answers

Is it possible to write map/reduce jobs for Amazon Elastic MapReduce using .NET?

Is it possible to write map/reduce jobs for Amazon Elastic MapReduce (http://aws.amazon.com/elasticmapreduce/) using .NET languages? In particular I would like to use C#. Preliminary research suggests not. The above URL's marketing text suggests you…
Chris
  • 9,986
  • 8
  • 48
  • 56
13
votes
3 answers

Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?

I think the title is already explaining my question. I would like to change key (tab space) value into key;value in all output files the reducers are generating from the output of mappers. I could not find good documentation on this using google.…
Bob
  • 991
  • 8
  • 23
  • 40
13
votes
2 answers

reuse JVM in Hadoop mapreduce jobs

I know we can set the property "mapred.job.reuse.jvm.num.tasks" to re-use JVM. My questions are: (1) how to decide the number of tasks to be set here, -1 or some other positive integers? (2) is it a good idea to already reuse JVMs and set this…
RecSys_2010
  • 275
  • 2
  • 4
  • 10
12
votes
3 answers

Distributed unit testing and code coverage in Python

My current project has a policy of 100% code coverage from its unit tests. Our continuous integration service will not allow developers to push code without 100% coverage. As the project has grown, so has the time to run the full test suite. While…
Joe Shaw
  • 22,066
  • 16
  • 70
  • 92
12
votes
3 answers

Mapreduce for dummies

Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something. While an example showing how many…
RockyMountainHigh
  • 2,871
  • 5
  • 34
  • 68
12
votes
2 answers

Hadoop JobConf class is deprecated , need updated example

I am writing hadoop programs , and i really dont want to play with deprecated classes . Anywhere online i am not able to find programs with updated org.apache.hadoop.conf.Configuration class insted of org.apache.hadoop.mapred.JobConf class. …
CodeBanger
  • 201
  • 1
  • 3
  • 9
12
votes
8 answers

How to start learning hadoop

I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data…
yesh
  • 2,052
  • 4
  • 28
  • 51
12
votes
2 answers

MapReduce/Aggregate operations in SpringBatch

Is it possible to do MapReduce style operations in SpringBatch? I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value. For example, Lets say i have a huge…
Sathish
  • 20,660
  • 24
  • 63
  • 71
12
votes
3 answers

Iterate through Swift array and change values

I need to change values of a Swift array. My first try was to just iterate through but this does not work as I only get a copy of each element and the changes do not affect the origin array. Goal is to have a unique "index" in each array…
Mike Nathas
  • 1,247
  • 2
  • 11
  • 29
12
votes
1 answer

MapReduce Linear Programming

Can a simple linear programming problem be solved on a distributed system using MapReduce?
Michael
  • 13,838
  • 18
  • 52
  • 81
12
votes
4 answers

Number of reducers in hadoop

I was learning hadoop, I found number of reducers very confusing : 1) Number of reducers is same as number of partitions. 2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). 3) Number of…
Mohit Jain
  • 357
  • 2
  • 7
  • 18
12
votes
5 answers

Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine…
Khal Mei
  • 131
  • 1
  • 1
  • 5
12
votes
3 answers

What is the computational complexity of the MapReduce overhead

Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational…
tonicebrian
  • 4,715
  • 5
  • 41
  • 65