Hadoop MapReduce read the data set once for multiple jobs

Question

I have a dataset formed by lots of small files (average 30-40 MB each). I wanted to run analytics on them by MapReduce but with each job, the mapper will read the files again which creates a heavy load on I/O performance (overheads etc.).

I wanted to know if it is possible to use the mapper once, emit various different outputs for different reducers? As I looked around, I saw that multiple reducers are not possible, but only possible thing is job chaining. However, I want to run these jobs in parallel, not sequentially, as they will all use the same dataset as input and run different analytics. So, in summary, the thing I want is something like below:

        Reducer = Analytics1
      /

Mapper - Reducer = Analytics2

      \
        Reducer = Analytics3 
               ...

Is this possible? Or do you have any suggestions for a workaround? Please give me some ideas. Reading these small files all over again creates a huge overhead and performance reduction for my analysis.

Thanks in advance!

Edit: I forgot to mention that I'm using Hadoop v2.1.0-beta with YARN.

You could have your reducer(s) do all the Analytics (1-3) in the same pass/job. — cabad, Oct 10 '13 at 15:27
But each reducer might take different inputs ( pairs). So running all analytics in only one reducer does not work for me. Mapper should emit different pairs for different reducers(analytics). In addition, with different pairs, I want to be able to benefit from Shuffle&Sort mechanism that happens just before reducers. — Engin Sözer, Oct 10 '13 at 15:37

cabad · Accepted Answer · 2013-10-10T19:16:01.683

You can:

Have your reducer(s) do all the Analytics (1-3) in the same pass/job. EDIT: From your comment I see that this alternative is not useful for you, but I am leaving it here for future reference, since in some cases it is possible to do this.
Use a more generalized model than MapReduce. For example, Apache Tez (still an incubator project) can be used for your use case.

Some useful references on Apache Tez:

Research paper that describes Apache YARN and related projects, including Apache Tez.
Several blog posts explaining Tez's model.

EDIT: Added the following regarding Alternative 1:

You could also make the mapper generate a key indicating to which analytics process the output is intended. Hadoop would automatically group records by this key, and send them all to the same reducer. The value generated by the mappers would be a tuple in the form <k,v>, where the key (k) is the original key you intended to generate. Thus, the mapper generates <k_analytics, <k,v>> records. The reducer, has a reducer method that reads the key, and depending on the key, calls the appropriate analytics method (within your reducer class). This approach would work, but only if your reducers do not have to deal with huge amounts of data, since you'll likely need to keep it in memory (in a list or a hashtable) while you do the analytics process (as the <k,v> tuples won't be sorted by their key). If this is not something your reducer can handle, then the custom partitioner suggested by @praveen-sripati may be an option to explore.

EDIT: As suggested by @judge-mental, alternative 1 can be further improved by having the mappers issue <<k_analytics, k>, value>; in other words, make the key within the analytics type part of the key, instead of the value, so that a reducer will receive all the keys for one analytics job grouped together and can perform streaming operations on the values without having to keep them in RAM.

I'm in the middle of a project and it seems really difficult to change the technology behind for now. Is there a way to do this with original Hadoop MapReduce? — Engin Sözer, Oct 10 '13 at 15:47
I don't think so. However, if you are using YARN (Hadoop 0.23 or 2.x), then you can easily use Tez, since it works on top of YARN. Hadoop has decoupled MapReduce so that MapReduce is now implemented on top of YARN, as is Tez and other models. — cabad, Oct 10 '13 at 15:55
I believe the tuple you should emit from the mapper for your 1. alternative is <, v>. Your mapper writes 3 records instead of one, and you end up with 3x as many groups, but you can tell from the key to which analytics type it belongs and thus what type of reduction algorithm to run on the group (and where to put the output). This can all be done with plain-old mapreduce. — Judge Mental, Oct 10 '13 at 19:06
Yes, that's better than my approach. What I suggested should work too, but the problem is that the keys from one analytics would be mingled together, so you need to keep track of them in RAM. Your approach uses less RAM. I'll update my answer. — cabad, Oct 10 '13 at 19:12

score 3 · Answer 2 · answered Oct 10 '13 at 18:23

3

It might be possible by using a custom partitioner. The custom partitioner will redirect the output of the mapper to appropriate reducer based on the key. So, the key of the mapper output would be R1*, R2*, R3***. Need to look into the pros and the cons of this approach.

As mentioned Tez is one of the alternative, but it is still under the incubator phase.

answered Oct 10 '13 at 18:23

Praveen Sripati

32,799
16
80
117

Some of my analytics already requires a Composite Key approach, so I'm using it as <(k1,k2), v>. Custom Partitioner is already implemented in my case. Is it further possible to extend this approach to further combinations? My data is huge so I do not want to keep everything in RAM and deal like that. Since I needed the group and sort mechanism, I used Composite Key for Shuffle&Sort stage. – Engin Sözer Oct 11 '13 at 11:18

Hadoop MapReduce read the data set once for multiple jobs

2 Answers2