Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
1 answer

Hadoop MapReduce - Pig/Cassandra - Unable to create input splits

I'm trying to run a MapReduce Job with Pig and Cassandra and I always get the error: ERROR 2118: Unable to create input splits for: cassandra://constellation/logs [SOLVED] There were some environment variables I missed to set: PIG_RPC_PORT,…
Christoph
  • 1,113
  • 5
  • 17
  • 35
3
votes
2 answers

Get count of product attributes from MongoDB

I have a mongo collection of products with attributes: { "_id" : ObjectId("5888a2860c001d31a1089958"), "product_id" : "107", "store_id" : 0, "attributes" : [{ "key" : "m", "value" : 21, "label" : "Mothercare" }, { …
gvozd1989
  • 300
  • 1
  • 16
3
votes
2 answers

How to separately specify a set of nodes for HDFS and others for MapReduce jobs?

While deploying hadoop, I want some set of nodes to run HDFS server but not to run any MapReduce tasks. For example, there are two nodes A and B that run HDFS. I want to exclude the node A from running any map/reduce task. How can I achieve it?…
syko
  • 3,477
  • 5
  • 28
  • 51
3
votes
1 answer

Different ways of Starting a MapReduce Job

What is the difference between starting a map reduce job in Apache Hadoop using simply the job.waitForCompletion(true) method and via ToolRunner.run(new MyClass(), args)? I have a MapReduce job executed in following two ways: First as…
KayV
  • 12,987
  • 11
  • 98
  • 148
3
votes
2 answers

RavenDB: Why do I get null-values for fields in this multi-map/reduce index?

Inspired by Ayende's article https://ayende.com/blog/89089/ravendb-multi-maps-reduce-indexes, I have the following index, that works as such: public class Posts_WithViewCountByUser :…
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
3
votes
4 answers

Map-Reduce : How to count in a collection

Important edit : I can't use filter - the purpose is pedagogic. I have an array in which I would want to count the number of its elements that verify a boolean, using only map and reduce. Count of the array's size I already wrote something that…
JarsOfJam-Scheduler
  • 2,809
  • 3
  • 31
  • 70
3
votes
0 answers

AWS EMR Hadoop Mapreduce physical memory limit error

I keep getting this error when running some of my steps: Container [pid=5784,containerID=container_1482150314878_0019_01_000015] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 10.2 GB of 27.5 GB…
refaelos
  • 7,927
  • 7
  • 36
  • 55
3
votes
0 answers

mrjob combiner not working python

Simple map combine reduce program: Map column-1 with value column-3 and append '+' in each mapper output of same key and append '-' after reduce output of same key. input_1 and input_2 both files contain a 1 2 3 a 4 5 6 Code is from mrjob.job…
piyush-balwani
  • 524
  • 3
  • 15
3
votes
1 answer

Loading more records than actual in HIve

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ? My query would be looking like this: insert overwrite table_a select col1,col2,col3,... from…
3
votes
2 answers

MongoDB MapReduce, return only when count > 1

I have data in MongoDB. The structure of one object is like this: { "_id" : ObjectId("5395177980a6b1ccf916312c"), "institutionId" : "831", "currentObject" : { "systemIdentifiers" : [ { "value" :…
3
votes
2 answers

MapReduce job to yield top 10 values using Python's MRjob

I want this map reduce job (code below) to output the top 10 most rated products. It keeps giving me the following error message: it = izip(iterable, count(0,-1)) # decorate TypeError: izip argument #1 must support iteration. I'm…
Ije
  • 43
  • 1
  • 7
3
votes
0 answers

Load JSON into MrJob - Python

I've been trying to load a JSON data file into mrjob, but can't really get it to work. from mrjob.job import MRJob from mrjob.protocol import JSONProtocol def type_hashing(entry): return entry[13].lower() class ReduceData(MRJob): …
Syspect
  • 921
  • 7
  • 22
  • 50
3
votes
0 answers

Hadoop - Globally sort mean and when is happen in MapReduce

I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1…
Saeed Rahmani
  • 650
  • 1
  • 8
  • 29
3
votes
3 answers

FAILED Error: java.io.IOException: Initialization of all the collectors failed

I am getting some error while running my MapReduce WordCount job. Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was :class wordcount.wordmapper at …
3
votes
1 answer

Can't insert new data in HBase when using Delete and Put at same time

I am using Hbase mapreduce to calculate a report. In the reducer, I try to clear the 'result' column family, and then add a new 'total' column. But I find the column family is delete, but new data is not insert. It seems the Put action doesn't work.…
B.H.
  • 107
  • 9