Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
3 answers

Set Multiple prefix row filter to scanner hbase java

I want to create one scanner that will give me result with 2 prefix filters For example I want all the rows that their key starts with the string "x" or start with the string "y". Currently I know to do it only with one prefix with the following…
MosheCh
  • 99
  • 3
  • 12
3
votes
1 answer

Javascript - Find (not remove) elements with duplicate properties in an object array

I want to fetch an object array, check if multiple properties are duplicate at the same time. Finally, the duplicate elements are meant to be alerted. i.e. for the array: [ { "language" : "english", "type" : "a", "value" : "value1" …
vahdet
  • 6,357
  • 9
  • 51
  • 106
3
votes
1 answer

MongoDB - how I turn this group() query to map/reduce

I have a collection where each document looks like this {access_key:'xxxxxxxxx', keyword: "banana", count:12, request_hour:"Thu Sep 30 2010 12:00:00 GMT+0000 (UTC)"} {access_key:'yyyyyyyyy', keyword: "apple", count:25, request_hour:"Thu Sep 30 2010…
rubayeet
  • 9,269
  • 8
  • 46
  • 55
3
votes
2 answers

File compression formats and container file formats

It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable. Does this mean that the blocks in the container format get…
Marco99
  • 1,639
  • 1
  • 19
  • 32
3
votes
2 answers

Using Hadoop Counters - Multiple jobs

I am working on a mapreduce project using Hadoop. I currently have 3 sequential jobs. I want to use Hadoop counters, but the problem is that I want to make the actual count in the first job, but access the counter value in the reducer of the 3rd…
A. Sarid
  • 3,916
  • 2
  • 31
  • 56
3
votes
2 answers

Chaining of mapreduce jobs

I came across "chaining of mapreduce jobs." Being new to mapreduce, under what circumstances do we have to chain (I am assuming chaining means running mapreduce jobs one after the other sequentially) jobs? And are there any examples that could…
sutterhome1971
  • 380
  • 1
  • 9
  • 22
3
votes
3 answers

mapreduce composite Key sample - doesn't show the desired output

Being new to mapreduce & hadoop world, after trying out basic mapreduce programs, I wanted to try compositekey sample code. Input dataset is as…
sutterhome1971
  • 380
  • 1
  • 9
  • 22
3
votes
0 answers

PIG: how the map-side join works?

In MapReduce, the requirements for a map-side join are: Data should be partitioned and sorted in particular way. Each input data should be divided in same number of partition. Must be sorted with same key. All the records for a particular key must…
drwho2
  • 75
  • 1
  • 7
3
votes
1 answer

Retrieve Filename of current line in Mapper

I am using Hadoop version 2.6.4 . I was writing a MapReduce job which would take 3 arguments namely -Keyword,Path to input files and output files. My ideal output should be names of all those files containing the keyword. The simple logic would be…
3
votes
0 answers

Reducer loop strange behavior

I'm new in mapreduce I'm trying to do a join of 2 different type of lines from two different csv files. The map is ok, I load the two files A and B, I match the lines that I want with the same key. In the reducer I am having a very strange behavior…
oriolfm14
  • 31
  • 3
3
votes
0 answers

Hadoop - MapReduce - Cleanup after a killed job

How can I do a cleanup for a Killed MapReduce job? I have implemented the commitJob() and abortJob() methods of the OutputCommitter API, It seems to be working well when the job is completed successfully or when the job fails. but it does not work…
urk
  • 31
  • 3
3
votes
2 answers

running job in hadoop - ERROR

I'm trying to run a program in hadoop ~ $ Desktop/HadoopProject2016.jar input output and i keep getting this error : Exception in thread "main" java.lang.UnsupportedClassVersionError: hadoop_project_16/AggregateJob : Unsupported major.minor…
user6365140
3
votes
2 answers

CouchDB Directed Acyclic Graph (DAG)

If my structure looks like this: [{Name: 'A', Depends: []}, {Name: 'B', Depends: ['A']}, {Name: 'C', Depends: ['A']}, {Name: 'D', Depends: ['C']}, {Name: 'E', Depends: ['D','B']}] How would I write the map and reduce functions such that my output…
Stefan Mai
  • 23,367
  • 6
  • 55
  • 61
3
votes
1 answer

Does MongoDB's Map/Reduce always return results in floats?

I am using Mongoid, which is on top of the Ruby MongDB driver. Even though my Map's emit is giving out a parseInt(num), and the Reduce's return is giving back also a parseInt(num), the final results still are floats. Is that particular to MongoDB? …
nonopolarity
  • 146,324
  • 131
  • 460
  • 740
3
votes
1 answer

Write data to local disk in each datanode

I want to store some value in map task into local disk in each data node. For example, public void map (...) { //Process List cache = new ArrayList(); //Add value to cache //Serialize cache to local file in this data…
nd07
  • 127
  • 9