Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

votes

2 answers

How to generate UUID in Mapreduce?

I want to write a MapReduce java Program where I need to create UUID for a set of data in csv/txt file. The data will be a customer data with set of rows and column. The input csv is located in the HDFS directory. Just need to generate UUID using…

asked Jul 07 '17 at 19:48

Rishab Oberoi

votes

1 answer

Why is JPS showing no process running?

I am running a hadoop using the apache-hadoop binary and i have started dfs, yarn and mr deamon using the following commands: start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver After this everything is working fine, viz, i could…

hadoop mapreduce hdfs hadoop-yarn hadoop2

asked Jun 21 '17 at 12:33

KayV

12,987
11
98
148

votes

0 answers

Mapreduce implementation

Input data is a json file and the structure of records is: {id=x, h1=0.1, h2=0.3, h3=0.8, h4=0.7}. The task is to implement a mapreduce execution to get "h" triples that contains a peak. In the previous example the output is x-> h2,h3,h4, because…

java hadoop mapreduce

asked Jun 16 '17 at 06:44

RamsesXVII

votes

3 answers

In bash how to transform multimap to a map of

I am processing output from a file in bash and need to group values by their keys. For example, I have the…

bash mapreduce

asked Jun 06 '17 at 01:58

Anoop

5,540
7
35
52

votes

1 answer

When does a mapper store its output to its local hard disk?

I know that The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the…

hadoop apache-spark mapreduce mapper reducers

asked Jun 03 '17 at 16:26

Neha Sharma

votes

1 answer

Do we really need sorting in the MapReduce framework?

I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a partition which consists of several pairs of

sorting hadoop mapreduce

asked Jun 03 '17 at 13:59

hesk

votes

3 answers

MongoDB MapReduce is much slower than pure Java processing?

I wanted to count all key's of my documents (inclusive embedded ones) of a collection. First I wrote a Java client to solve this. It took less than 4 seconds to show the result. Then I wrote a map/reduce function. The result was fine but running…

java performance mongodb mapreduce

asked Dec 13 '10 at 15:26

Kay

votes

1 answer

python mapreduce - Skipping the first line of the .csv in mapper

I am trying to do mapreduce in python and my csv file looks like below, trip_id taxi_id pickup_time dropoff_time ... total 0 20117 2455.0 2013-05-05 09:45:00 50.44 1 44691 1779.0 2013-06-24 11:30:00 66.78 and my…

python csv hadoop mapreduce mrjob

asked May 28 '17 at 21:17

TTaa

votes

1 answer

How to handle Incremental Update in HDFS hadoop Map-Reduce

I have structured base text files in HDF which have data like this (in…

hadoop apache-spark mapreduce hdfs

asked May 25 '17 at 05:31

Sudarshan kumar

1,503
4
36
83

votes

1 answer

Getting "User [dr.who] is not authorized to view the logs for application " while running a YARN application

I'm running a custom Yarn Application using Apache Twill in HDP 2.5 cluster, but I'm not able to see my own container logs (syslog, stderr and stdout) when I go to my container web page: Also the login changes from my kerberos to "dr.who" when I…

hadoop mapreduce hadoop-yarn hadoop2 apache-twill

asked May 09 '17 at 19:19

insanely_sin

votes

3 answers

Hadoop MapReduce InputFormat Deprecated?

I need to implement a custom (service) input source for a Hadoop MapReduce app. I google'd and SO'd and found that one way to proceed is to implement a custom InputFormat. Is that correct? Apparently according to…

hadoop mapreduce

asked Dec 08 '10 at 04:39

Sri

5,805
10
50
68

votes

1 answer

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys? My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns. For example, Anything…

hadoop mapreduce

asked Dec 07 '10 at 20:48

James Cramer

votes

2 answers

dask bag foldby with numpy arrays

I get a very uninformative FutureWarning message from dask / numpy when doing a foldby on a dask.bag that contains numpy arrays. def binop(a, b): print('binop') return a + b[1] def combine(a, b): print('combine') return a +…

python numpy parallel-processing mapreduce dask

asked May 05 '17 at 12:13

Matti Lyra

12,828
8
49
67

votes

0 answers

Using mapreduce with Mongoose, not able to find emit function

I am trying to create a mapreduce function using the example on the Mongoose website located here. This application is being created using TypeScript and node.js. This example uses a function called emit which I am not able to find. I keep getting…

mongoose mapreduce typescript-typings

asked Apr 25 '17 at 18:13

user1790300

2,143
10
54
123

votes

1 answer

Apache Ignite map-reduce way of solving equations

I have an equation that can be described by a tree. So the leaves are values with parent vertex being a math operator and when the computation is done, another value appears in the place of parent vertex and it becomes a leaf with a parent vertex(as…

java mapreduce ignite

asked Apr 13 '17 at 02:33

Ram

Prev 1 2 3

…

99 100 Next