Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
25
votes
5 answers

Checksum Exception when reading from or copying to hdfs in apache hadoop

I am trying to implement a parallelized algorithm using Apache hadoop, however I am facing some issues when trying to transfer a file from the local file system to hdfs. A checksum exception is being thrown when trying to read from or transfer a…
lvella
  • 419
  • 1
  • 5
  • 11
24
votes
5 answers

How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is *> and…
user572138
  • 463
  • 4
  • 6
  • 13
24
votes
5 answers

what are the options for hadoop on scala

We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
prassee
  • 3,651
  • 6
  • 30
  • 49
23
votes
5 answers

MapReduce alternatives

Are there any alternative paradigms to MapReduce (Google, Hadoop)? Is there any other reasonable way how to split & merge big problems?
Cartesius00
  • 23,584
  • 43
  • 124
  • 195
23
votes
3 answers

Application failed 2 times due to AM Container: exited with exitCode: 1

I ran a mapreduce job on hadoop-2.7.0 but mapreduce job can't be started and I faced with this bellow error: Job job_1491779488590_0002 failed with state FAILED due to: Application application_1491779488590_0002 failed 2 times due to AM Container…
Erfan Farhangy
  • 449
  • 2
  • 7
  • 14
23
votes
3 answers

Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation? A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size…
Kumar
  • 949
  • 1
  • 13
  • 23
23
votes
4 answers

Reading HDFS and local files in Java

I want to read file paths irrespective of whether they are HDFS or local. Currently, I pass the local paths with the prefix file:// and HDFS paths with the prefix hdfs:// and write some code as the following Configuration configuration = new…
Venk K
  • 1,157
  • 5
  • 14
  • 25
23
votes
4 answers

What is the use of grouping comparator in hadoop map reduce

I would like to know why grouping comparator is used in secondary sort of mapreduce. According to the definitive guide example of secondary sorting We want the sort order for keys to be by year (ascending) and then by temperature (descending): 1900…
Pramod
  • 493
  • 1
  • 8
  • 16
22
votes
7 answers

Hadoop Streaming Job failed error in python

From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error ERROR streaming.StreamJob: Job not Successful! 10/12/16 17:13:38 INFO streaming.StreamJob: killJob... Streaming Job…
db42
  • 4,474
  • 4
  • 32
  • 36
22
votes
2 answers

RavenDB Map-Reduce Example using .NET Client

I'm looking for an example of how to implement and use Map-Reduce within the RavenDB .NET Client. I'd like to apply it to a specific scenario: generating unique and total visitor counts. A sample document that would be stored within RavenDB:…
user111013
22
votes
5 answers

Run Hadoop job without using JobConf

I can't find a single example of submitting a Hadoop job that does not use the deprecated JobConf class. JobClient, which hasn't been deprecated, still only supports methods that take a JobConf parameter. Can someone please point me at an example…
Greg Cottman
  • 658
  • 1
  • 5
  • 7
22
votes
5 answers

hadoop map reduce secondary sorting

Can any one explain me how secondary sorting works in hadoop ? Why must one use GroupingComparator and how does it work in hadoop ? I was going through the link given below and got doubt on how groupcomapator works. Can any one explain me how…
user1585111
  • 1,019
  • 6
  • 19
  • 35
22
votes
2 answers

Renaming Part Files in Hadoop Map Reduce

I have tried to use the MultipleOutputs class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html Driver Code Configuration conf = new…
Arun A K
  • 2,205
  • 2
  • 27
  • 45
22
votes
5 answers

Hadoop MapReduce provide nested directories as job input

I'm working on a job that processes a nested directory structure, containing files on multiple levels: one/ ├── three/ │   └── four/ │   ├── baz.txt │   ├── bleh.txt │   └── foo.txt └── two/ ├── bar.txt └── gaa.txt When I add…
sa125
  • 28,121
  • 38
  • 111
  • 153
21
votes
6 answers

Calling a mapreduce job from a simple java program

I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[]) method by also passing the input and output paths for…
Ravi Trivedi
  • 527
  • 1
  • 5
  • 12