Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
3 answers

Mapfile as a input to a MapReduce job

I recently started to use Hadoop and I have a problem while using a Mapfile as a input to a MapReduce job. The following working code, writes a simple MapFile called "TestMap" in hdfs where there are three keys of type Text and three values of type…
Luca
  • 31
  • 1
  • 2
3
votes
1 answer

Hadoop on Mac in intelliJ IDEA setup

Installed hadoop using brew, now want to run hadoop jobs in intelliJ IDEA. How to setup the environment and resolve dependencies?
Atul Kaushik
  • 5,181
  • 3
  • 29
  • 36
3
votes
1 answer

Hive Merge Small ORC Files

My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks. My Input and Output Both Are S3 and Environment using is EMR, Hive Parameters which am…
Rajiv
  • 392
  • 6
  • 22
3
votes
2 answers

Reduce Multi Dimensional Array

I'm currently trying to map and reduce functions to flatten out a multidimensional array. This is a mock example data set: data: [ { label: "Sort-01" data: [ { label: "OCT-2017" weight: 2304 }, { …
Jay
  • 73
  • 3
  • 14
3
votes
1 answer

Partitioning by column in Apache Spark to S3

have use-case where we want to read files from S3 which has JSON. Then, based on a particular JSON node value we want to group the data and write it to S3. I am able to read the data but not able to find good example on how partition the data based…
Ajay
  • 473
  • 7
  • 25
3
votes
2 answers

using reduce inside a python map function

I have this case where I have a list of lists of lists, and I need to apply a reduce on each of the sub lists of the first list. The reduce function requires 2 parameters, but that 2nd parameter (the list of lists I want to apply the reduce on), is…
Abhishek
  • 438
  • 1
  • 6
  • 16
3
votes
0 answers

pass custom object to reducer ,get null at all

I'm new to mapreduce , my mapper output a DbWritable object , in reducer procedure, I cann't get any value from passed object ,maybe it hadn't passed at all ? here is my code DBWritable public class StWritable implements DBWritable,Writable { …
iwish
  • 31
  • 5
3
votes
0 answers

Mongodb reduce fails on value too large

I'm trying to use map-reduce on a large set of documents where for each document there are a few fields which I want to gather into an array. (array for each field type). I thought that map-reduce is the right pattern for this job but I am receiving…
Eytan
  • 728
  • 1
  • 7
  • 19
3
votes
2 answers

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file. I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details: the…
sandeepkunkunuru
  • 6,150
  • 5
  • 33
  • 37
3
votes
1 answer

Hadoop MapReduce :java.io.EOFException: Premature EOF: no length prefix available

when I try the Example: WordCount v1.0 from http://hadoop.apache.org/docs/r2.7.4/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0 I got the warns and Exceptions blow: And I found that when I put…
Qinghe Wang
  • 61
  • 2
  • 8
3
votes
2 answers

Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

I'm using Amazon's elastic map reduce. I have log files that look something like this random text foo="1" more random text foo="2" more text notamatch="5" noise foo="1" blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ... How can…
lmonson
  • 101
  • 4
3
votes
1 answer

How do the hive sql queries are submitted as mr job from hive cli

I have deployed a CDH-5.9 cluster with MR as hive execution engine. I have a hive table named "users" with 50 rows. Whenever I execute the query select * from users works fine as follows : hive> select * from users; OK Adam 1 38 …
S.K. Venkat
  • 1,749
  • 2
  • 23
  • 35
3
votes
4 answers

Iterative MapReduce

I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API. I would like…
Deepak
  • 731
  • 2
  • 9
  • 14
3
votes
1 answer

How to avoid large intermediate result before reduce?

I'm getting an error in a spark job that's surprising me: Total size of serialized results of 102 tasks (1029.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) My job is like this: def add(a,b): return a+b sums =…
user48956
  • 14,850
  • 19
  • 93
  • 154
3
votes
2 answers

Where can I find a HBase cascading module for hbase-0.89.20100924+28?

I am working on a project using map reduce and HBase. We are using Cloudera’s CDH3 distribution which has hbase-0.89.20100924+28 bundled into it. I would like to use cascading as we have some processing that requires multiple map reduce jobs, but I…
Rob
  • 245
  • 1
  • 5
  • 14