Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
2 answers

How to generate UUID in Mapreduce?

I want to write a MapReduce java Program where I need to create UUID for a set of data in csv/txt file. The data will be a customer data with set of rows and column. The input csv is located in the HDFS directory. Just need to generate UUID using…
3
votes
1 answer

Why is JPS showing no process running?

I am running a hadoop using the apache-hadoop binary and i have started dfs, yarn and mr deamon using the following commands: start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver After this everything is working fine, viz, i could…
KayV
  • 12,987
  • 11
  • 98
  • 148
3
votes
0 answers

Mapreduce implementation

Input data is a json file and the structure of records is: {id=x, h1=0.1, h2=0.3, h3=0.8, h4=0.7}. The task is to implement a mapreduce execution to get "h" triples that contains a peak. In the previous example the output is x-> h2,h3,h4, because…
RamsesXVII
  • 295
  • 2
  • 11
3
votes
3 answers

In bash how to transform multimap to a map of

I am processing output from a file in bash and need to group values by their keys. For example, I have the…
Anoop
  • 5,540
  • 7
  • 35
  • 52
3
votes
1 answer

When does a mapper store its output to its local hard disk?

I know that The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the…
Neha Sharma
  • 295
  • 1
  • 2
  • 12
3
votes
1 answer

Do we really need sorting in the MapReduce framework?

I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a partition which consists of several pairs of
hesk
  • 317
  • 3
  • 11
3
votes
3 answers

MongoDB MapReduce is much slower than pure Java processing?

I wanted to count all key's of my documents (inclusive embedded ones) of a collection. First I wrote a Java client to solve this. It took less than 4 seconds to show the result. Then I wrote a map/reduce function. The result was fine but running…
Kay
  • 624
  • 1
  • 7
  • 17
3
votes
1 answer

python mapreduce - Skipping the first line of the .csv in mapper

I am trying to do mapreduce in python and my csv file looks like below, trip_id taxi_id pickup_time dropoff_time ... total 0 20117 2455.0 2013-05-05 09:45:00 50.44 1 44691 1779.0 2013-06-24 11:30:00 66.78 and my…
TTaa
  • 331
  • 5
  • 12
3
votes
1 answer

How to handle Incremental Update in HDFS hadoop Map-Reduce

I have structured base text files in HDF which have data like this (in…
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
3
votes
1 answer

Getting "User [dr.who] is not authorized to view the logs for application " while running a YARN application

I'm running a custom Yarn Application using Apache Twill in HDP 2.5 cluster, but I'm not able to see my own container logs (syslog, stderr and stdout) when I go to my container web page: Also the login changes from my kerberos to "dr.who" when I…
insanely_sin
  • 986
  • 1
  • 14
  • 22
3
votes
3 answers

Hadoop MapReduce InputFormat Deprecated?

I need to implement a custom (service) input source for a Hadoop MapReduce app. I google'd and SO'd and found that one way to proceed is to implement a custom InputFormat. Is that correct? Apparently according to…
Sri
  • 5,805
  • 10
  • 50
  • 68
3
votes
1 answer

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys? My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns. For example, Anything…
3
votes
2 answers

dask bag foldby with numpy arrays

I get a very uninformative FutureWarning message from dask / numpy when doing a foldby on a dask.bag that contains numpy arrays. def binop(a, b): print('binop') return a + b[1] def combine(a, b): print('combine') return a +…
Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
3
votes
0 answers

Using mapreduce with Mongoose, not able to find emit function

I am trying to create a mapreduce function using the example on the Mongoose website located here. This application is being created using TypeScript and node.js. This example uses a function called emit which I am not able to find. I keep getting…
user1790300
  • 2,143
  • 10
  • 54
  • 123
3
votes
1 answer

Apache Ignite map-reduce way of solving equations

I have an equation that can be described by a tree. So the leaves are values with parent vertex being a math operator and when the computation is done, another value appears in the place of parent vertex and it becomes a leaf with a parent vertex(as…
Ram
  • 325
  • 4
  • 22