Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
0 answers

Hadoop: MapReduce job giving java library error

When I am running any MapReduce job in Cloudera VM, below warning is occurring 4-5 times in continues manner. Please let me know how to fix it. 16/11/06 00:47:38 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at…
Ramkrushna26
  • 125
  • 1
  • 7
3
votes
1 answer

Hadoop mapreduce task failing with 143

I am currently learning to use Hadoop mapred an have come across this error: packageJobJar: [/home/hduser/mapper.py, /home/hduser/reducer.py, /tmp/hadoop-unjar4635332780289131423/] [] /tmp/streamjob8641038855230304864.jar tmpDir=null 16/10/31…
hudsond7
  • 666
  • 8
  • 25
3
votes
4 answers

Why do we need setup() method in MapReduce when we can initialize parameters in map() or reduce()?

I am new to Hadoop and overall MapReduce paradigm. I searched a lot on the web regarding overriding the setup() method in Map class to access the configuration object. But from what I read, it seems that the setup() method is anyways called every…
GDams
  • 33
  • 3
3
votes
1 answer

Different output while running mapreduce on local machine in IDEA and in hadoop on cluster

The problem is what it says in the description. I have some code. This is the reducer. public class RTopLoc extends Reducer { private static int number = 0; private static CompositeKey lastCK = new…
gjin
  • 860
  • 1
  • 14
  • 28
3
votes
2 answers

How to configure hadoop's mapper so that it takes

I'm using two mappers and two reducers. I'm getting the following error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text This is because the first reducer writes and…
Hernan
  • 1,149
  • 1
  • 11
  • 29
3
votes
3 answers

What is the simplest way of parallelization over a cluster with SSH and NFS?

I have a lot of trivially parallelizable computations and a lot (100s) of cores distributed overs SSH + NFS network. What is the simplest way of parallelization. The problem is that I don't know how long each task will take so I need some kind of…
Łukasz Lew
  • 48,526
  • 41
  • 139
  • 208
3
votes
1 answer

hive aggregate query takes wrong value from cache

I am running aggregate query on hive session. hive>select count(1) from table_name; For the first time it runs mapreduce program and returns result. But for the consecutive runs later in the day it returns same count from the cache(though table is…
sumitya
  • 2,631
  • 1
  • 19
  • 32
3
votes
1 answer

Hadoop Text Comparison not working

Below is the code for a Hadoop Reducer, I am not able to understand why the comparison(placed between slashes) always failing, here we are comparing two Text type values. This code is for a Reducer doing Inverted Indexing. public static class…
Sonu Patidar
  • 701
  • 1
  • 5
  • 10
3
votes
0 answers

We are facing below exception while running MapReduce program to parse Blockchain file (encrypted file .DAT)

We have executed the below hadoop command and the program uses MarReduce api. when we run the program it is throwing an Exception even though file is present at the input location. Please guide us. [cloudera@quickstart mapreduce-bitcoinblock-1.0]$…
Dayanand
  • 31
  • 2
3
votes
0 answers

MapReduce advanced algorithm on Reversed Web-Link Graph (from google paper)

I was poking around Hadoop MapReduce after reading the paper from google: MapReduce: Simplified Data Processing on Large Clusters I worked on the Reversed Web-Link Graph because it seems interesting, and actually it was quite easy when I code it. So…
3
votes
3 answers

Extracting a list of substrings from MongoDB using a Regular Expression

I need to extract a part of a string that matches a regex and return it. I have a set of documents such as: {"_id" :12121, "fileName" : "apple.doc"}, {"_id" :12125, "fileName" : "rap.txt"}, {"_id" :12126, "fileName" : "tap.pdf"}, {"_id" :12126,…
Macky
  • 433
  • 2
  • 9
  • 22
3
votes
0 answers

Yarn app timeout and no error

I am running a map-reduce job triggered by YARN's REST API. The yarn app starts, triggers another map-reduce job. But the actual yarn app timeout exactly around 12 mins. This is the final log where it ends: 2016-09-01 13:22:53 DEBUG…
spiralarchitect
  • 880
  • 7
  • 19
3
votes
4 answers

MapReduce algorithm to find continuous sequence

I need to write a small map/reduce function that should return 1 if there are at least 4 continuous 0 in an array, otherwise it should return something different than 1. Example: [6, 7, 5, 0, 0, 0, 0, 3, 4, 0, 0, 1] => 1 [6, 7, 5, 0, 0, 0, 3, 4,0,…
James
  • 13,571
  • 6
  • 61
  • 83
3
votes
1 answer

Is Tez always better than MR as Hive execution engine?

Is it true that generally for smaller queries (expecting result in interactive fashion, in minutes, than hours) Tez performs better and for batch queries (taking hours) MR performs better as an execution engine? Or can we say that irrespective of…
Dhiraj
  • 3,396
  • 4
  • 41
  • 80
3
votes
1 answer

Spark flat map function is throwing "OutOfMemory"

I have below implementation in MapReduce and it is working fine, Now I am trying to port this to Spark by using FlatMapFunction, but this function throws out of memory error. MapReduce: String[] hexList = input.toString().split(","); int…
Ajeet
  • 675
  • 1
  • 6
  • 20