Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
3
votes
1 answer

How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

I have a Hadoop job in java, which has sequence output format: job.setOutputFormatClass(SequenceFileOutputFormat.class); I want to use Parquet format instead. I tried to set it in the naive…
Viacheslav Shalamov
  • 4,149
  • 6
  • 44
  • 66
3
votes
5 answers

JavaScript Map / Reduce to return grouped by count

I have a JSON collection as an array. I would like to group by three fields within the collection and then return the result along with the count of the matching documents. The example below will hopefully make it clearer. The JSON document…
Dave
  • 53
  • 9
3
votes
2 answers

Can a "pre-computed" map-reduce index (à la RavenDB/CouchDB) be used for this kind of algorithm?

I'm trying to see if a specific algorithm can be translated to the kind of map-reduce index RavenDB/CouchDB uses, ie, "pre-computed" map-reduce (which means the indexes are refreshed on insertion and updates, not when performing the actual…
Simon Labrecque
  • 577
  • 7
  • 15
3
votes
1 answer

Combiner function in python hadoop streaming

I have a mapper that outputs key and value , which is sorted and piped into reducer.py , As the keys are already sorted, before I get to the reducer , I want to write a combiner which iterates through the sorted list and outputs key , [ v1,v2,v3]…
AlgoMan
  • 2,785
  • 6
  • 34
  • 40
3
votes
1 answer

PouchDB view lookup by key

I have an CouchDB database named t-customers. Using Fauxton I've created the following view t-customers/_design/t-cust-design/_view/by-custdes. Here is the map function: function (doc) { var custname = doc.CUSTNAME; if(custname != undefined &&…
altgov3en
  • 1,100
  • 2
  • 18
  • 33
3
votes
1 answer

Hbase Bulkload append Data instead overwrite them

acutally I'am loading Data into Hbase with the help of Mapreduce and Bulkload, which I implemented in Java. So basically I created a Mapper and use HFileOutputFormat2.configureIncrementalLoad (full code at the end of the question) for reduce and i…
Bierbarbar
  • 1,399
  • 15
  • 35
3
votes
2 answers

How does the hadoop fix the number of mappers or Input splits when mapreduce task is done over multiple input files?

I've four input files (CSV) of sizes 453MB, 449MB, 646MB and 349MB. All these constitute to a total size of 1.85GB. HDFS block size is 128MB. Record size is very less as there are hardly 20 fields. After the completion of mapreduce task, I can…
SatishV
  • 393
  • 4
  • 22
3
votes
1 answer

Which protocol is used in Hadoop to copy the data from Mappers to Reducers?

I have some doubt regarding the transfer protocols being used by Hadoop framework to copy the mapper output(which is stored locally on mapper node) to the reducers task (which is not running on same node). - read some blogs that it uses HTTP for…
SurjanSRawat
  • 489
  • 1
  • 6
  • 20
3
votes
1 answer

Count records in MongoDB by regex match

I have records in database that contains URLs. For example, https://www.youtube.com/watch?v=blablabla. I want to count URLs for each site. For example [{ site: 'youtube.com', count: 25 }, { site: 'facebook.com', count: 135 }] I…
3
votes
2 answers

How to pick up the earliest timestamp date from the RDD in scala

I have a RDD which would be like ((String, String), TimeStamp). I have large number of records and I want to select for each key the record with latest TimeStamp value. I have tried the following code and still struggling to to this. Can anybody…
Kepler
  • 399
  • 1
  • 7
  • 19
3
votes
1 answer

Hive Vertex failed: killed/failed due to:ROOT_INPUT_INIT_FAILURE Caused by: java.lang.NullPointerException

I was querying a table,a simple count(*) and received the following error: Vertex failed, vertexName=Map 1, vertexId=vertex_1486982569467_0809_3_00, diagnostics=[Vertex vertex_1486982569467_0809_3_00 [Map 1] killed/failed due…
Sahil
  • 413
  • 2
  • 5
  • 8
3
votes
0 answers

Need a better explanation of Communication Cost Model for MapReduce than in MMDS

I was going through MMDS book that has an online MOOC by the same name. I'm having trouble understanding Communication Cost Model and the Join Operation Calculations mentioned in Topic 2.5 and am surprised by how poorly organized the book is as the…
Shehryar
  • 530
  • 1
  • 4
  • 18
3
votes
2 answers

Mapreduce questions

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words... I noticed that after doing wordcount for file 1, the results that go into the…
Jon
  • 71
  • 1
  • 1
  • 5
3
votes
2 answers

Hadoop: How to find out the partition_Id in reduce step using Context object

In Hadoop API ver. 0.20 and above the Context object was introduced instead JobConf. I need to find out using Context object: the partition_id for current Reducer the output folder Using obsoleted JobConf I can find the partition_id for current…
user510040
  • 159
  • 2
  • 10
3
votes
1 answer

pyspark: Filter one RDD based on certain columns of another RDD

I have two files in a spark cluster, foo.csv and bar.csv, both with 4 columns and the same exact fields: time, user, url, category. I'd like to filter out the foo.csv, by certain columns of bar.csv. In the end, I want key/value pairs of (user,…
Andrew Chong
  • 63
  • 1
  • 4