How to filter keys or values in Hadoop map/reduce job output file?

Question

Normally, Hadoop map/reduce job produces list of key-value pairs that are written to job's output file (using OutputFormat class). Rarely, both keys and values are useful, usually either keys or values contain required information.

Is there an option (on client side) to suppress keys in output file or to suppress values in output file? If I wanted to do this for just one particular job, I could create new OutputFormat implementation that would ignore keys or values. But I need generic solution that is reusable for more jobs.

EDIT: It might be unclear what I mean by "I need generic solution that is reusable for more jobs." Let me explain that on example:

Let's say I have a lot of prepared Mapper, Reducer, OutputFormats classes. I want to combine them to different 'jobs' and run those 'jobs' on different input files to produce various output files. In some cases (for some jobs) I need to suppress keys, so they are not written to output file. I do not want to change code of my mappers, reducers of output formats - there is just too many of them to do that. I need some generic solution that does not need to change code of given mappers, reducer or output formats. How do I do that?

Have you considered passing in additional parameters via the JobConf and then switching the logic based on the values ? — Sudarshan, Apr 21 '14 at 05:55

score 0 · Answer 1 · answered Nov 26 '12 at 22:40

0

There's no reason why your final step in a hadoop flow can't be configured to write a NullWritable as either a key or a value. You just shouldn't expect that file to be much use in any subsequent map reduce steps.

answered Nov 26 '12 at 22:40

Chris Gerken

16,221
6
44
59

I do on quite understand... I expect that by "final step" you mean reduce phase. You suggest that my reducer has `NullWritable` as either key or value type? In that case this is not reusable solution... Lets say I have set of prepared mappers, reducers, output formats,... I want to run some of them many times on different input files but without modifying code of all those mappers and reducers I want to suppress keys in some cases. – Rasto Nov 26 '12 at 22:47
1

Right, if you want to suppress keys you'll either have to run another mapreduce step to suppress them or use MultipleOutputs in your Reducer to write two versions of the data: one with keys and one which suppresses keys. – Chris Gerken Nov 26 '12 at 22:51
I'm afraid we still do not understand each other :) Please see my edit of the original question. I tried to explain better what I want to do. Thank you – Rasto Nov 26 '12 at 22:59
By another mapreduce step you mean another job? That will take the output file of the original job as an input, remove keys and then write result to another output file? – Rasto Nov 26 '12 at 23:01
2

Yes. That's correct. There's no way to magically wipe out a files keys or values. – Chris Gerken Nov 26 '12 at 23:16

How to filter keys or values in Hadoop map/reduce job output file?

1 Answers1