Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
34
votes
7 answers

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like: db.receipts.findOne() { "_id" : ObjectId("4e57908c7a044a30dc03a888"), …
nelstrom
  • 18,802
  • 13
  • 54
  • 70
34
votes
6 answers

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that…
Chander Shivdasani
  • 9,878
  • 20
  • 76
  • 107
32
votes
8 answers

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
amit-agrawal
  • 1,563
  • 2
  • 13
  • 24
31
votes
1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…
tony_tiger
  • 789
  • 1
  • 11
  • 25
31
votes
2 answers

Cognitive Complexity and its effect on the code

W.r.t to one of the java projects, we recently started using SonarLint. Output of the code analysis shows too many critical code smell alerts. Critical code smell: Refactor this method to reduce its Cognitive Complexity. I have heard about…
vmorusu
  • 936
  • 1
  • 15
  • 32
31
votes
7 answers

20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem: Data is organized in a star schema structure with one BIG fact and ~15 dimensions. 20B fact rows…
Haggai
31
votes
5 answers

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate. In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of…
Igor ostrovsky
  • 7,282
  • 2
  • 29
  • 28
31
votes
1 answer

Hadoop speculative task execution

In Google's MapReduce paper, they have a backup task, I think it's the same thing with speculative task in Hadoop. How is the speculative task implemented? When I start a speculative task, does the task start from the very begining as the older and…
lil
  • 2,527
  • 4
  • 22
  • 15
31
votes
6 answers

No such method exception Hadoop

When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient:…
London guy
  • 27,522
  • 44
  • 121
  • 179
30
votes
4 answers

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions? What does uber mode do? Does it works differently in…
Mohammed Asad
  • 979
  • 1
  • 8
  • 18
29
votes
4 answers

Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even…
Ahmadov
  • 1,567
  • 5
  • 31
  • 48
29
votes
1 answer

Map-Reduce performance in MongoDb 2.2, 2.4, and 2.6

I've found this discussion: MongoDB: Terrible MapReduce Performance. Basically it says try to avoid Mongo's MR queries as it single-threaded and not supposed to be for real-time at all. 2 years has passed, and I wonder what has been changed since…
YMC
  • 4,925
  • 7
  • 53
  • 83
28
votes
8 answers

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
Roman Kagan
  • 10,440
  • 26
  • 86
  • 126
28
votes
2 answers

Rolling your own reduceByKey in Spark Dataset

I'm trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don't see that function for Dataset. So I decided to write one. someRdd.map(x =>…
Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66
28
votes
10 answers

IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus

I'm trying to run small spark application and am getting the following exception: Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class…
Lika
  • 1,043
  • 2
  • 10
  • 13