Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
1 answer

Map Reduce Job to find the popular items in a time window

I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights. The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a…
Thiago
  • 694
  • 3
  • 12
  • 26
3
votes
3 answers

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types : 1) product_info_timestamp 2) user_info_timestamp 3) user_activity_timestamp The number of files received can be of any number but they will belong to one of…
user3829376
  • 75
  • 4
  • 11
3
votes
1 answer

Hadoop: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

I'm trying to launch a fairly simple WordCount (I pretty much followed this tutorial) after installing Hadoop but I get this: 2018-04-05 16:51:00,192 INFO mapreduce.Job: Job job_1522936330711_0007 failed with state FAILED due to: Application…
3
votes
2 answers

Why tupleWritable become empty when passed to recuder

I have a map(Object key,Text value,Context context) , put a tupleWritable in the context with context.write(). and In the reduce(Text key,Iterable values,Context context),I read the tupleWritable ,but it's empty. below is my code.That confused me…
wangguanguo
  • 181
  • 4
3
votes
1 answer

Sample outputs of Rumen or Sample input to Gridmix

I am quite new to the use of big data tools like Hadoop. I want to execute a publicly available cluster trace (https://github.com/google/cluster-data) on Yarn/or Yarn Simulator. One way to do is to feed input into Yarn via Gridmix. The format in…
PHcoDer
  • 1,166
  • 10
  • 23
3
votes
1 answer

Hive - Select count(*) not working with Tez with but works with MR

I have a Hive external table with parquet data. When I run select count(*) from table1, it fails with Tez. But when execution engine is changed to MR it works. Any idea why it's failing with Tez? I'm getting the following error with Tez: Error:…
kunrazor
  • 341
  • 1
  • 4
  • 10
3
votes
2 answers

For NetSuite Map/Reduce script - Why is map stage failing when being called from Restlet?

In NetSuite, have a Restlet script that calls a deployed map/reduce script but the map stage shows as Failed when looking at details of status page (the getInputData stage does run and shows as Complete). However, if I do a "Save and Execute" from…
3
votes
1 answer

error -Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I was trying to write a mapreduce code in java.So here are my files. mapper class(bmapper): public class bmapper extends Mapper{ private String txt=new String(); public void mapper(LongWritable key,Text…
khush
  • 23
  • 3
3
votes
0 answers

Sorting an entire dataset in apache beam

Let's say that I have a massive collection of strings and I wish to use apache beam to sort it. Is this possible? I only managed to find documentation about running sort on a single machine, but what I'm looking for is a distributed sort algorithm.
tohava
  • 5,344
  • 1
  • 25
  • 47
3
votes
2 answers

Map-Reduce to solve Matrix multiplication in python with Hadoop

I would like to apply map-reduce to deal with matrix multiplication in python with Hadoop. The goal is to calculate A * B. The output should be similar with the input. Input are two matrix A and B formate looks like…
HHKSHD_HH
  • 73
  • 1
  • 8
3
votes
0 answers

JanusGraph loading with MapReduce

I am using JanusGraph with HBase as storage backend. I currently have terrabytes of RDBMS data in HDFS. I would like to write a MapReduce code that transforms the RDBMS data to a graph format and then write that to JanusGraph. I can't seem to find…
user3207663
  • 156
  • 1
  • 9
3
votes
1 answer

Is Hive QL have same expressive power as writing your own MapReduce Jobs directly on Hadoop?

To put in other words, Is there a problem that can be solved by directly defining your map reduce jobs, but for which you cannot form a Hive QL query? If yes, then it means that Hive QL is limited in it's expressive power and cannot express all…
user855
  • 19,048
  • 38
  • 98
  • 162
3
votes
3 answers

HBase Mapreduce on multiple scan objects

I am just trying to evaluate HBase for some of data analysis stuff we are doing. HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is…
StackUnderflow
  • 24,080
  • 14
  • 54
  • 77
3
votes
1 answer

How is JavaScript's Reduce assigning its value?

Problem: I do not understand how reduce is assigning/reducing the customer name from the array. I need for someone to please explain precisely what is happening here. Detailed Description In episode 4 of Fun Fun Function's functional programming…
Anthony Gatlin
  • 4,407
  • 5
  • 37
  • 53
3
votes
4 answers

How to reduce on a list of tuples in python

I have an array and I want to count the occurrence of each item in the array. I have managed to use a map function to produce a list of tuples. def mapper(a): return (a, 1) r = list(map(lambda a: mapper(a), arr)); //output example:…
Lee
  • 2,874
  • 3
  • 27
  • 51