save groupBy rdd result back to HDFS

Question

I have some customized thrift-compact based serialization format file on hdfs as input. By extending hadoop's FileInputFormat, we can quickly load the files into a RDD struct.

Now, after applying some groupBy transformation, the output RDD becomes a JavaPairRDD<Long, Iterable<thriftGeneratedClass>>. Now, I would like to save the groupBy RDD results back to HDFS with multiple output files by the key in the PairRDD. E.X.: if we have two keys-100, 200 in the PairRDD, two files 100.thrift and 200.thrift will be generated. Each file contains all the iterable list of the thrift class. code looks like:

//Feature is some thrift generated class 
JavaRDD<Feature> featureJavaRDD = jsc.newAPIHadoopFile(inputPath, ThriftInputFormat.class,
            NullWritable.class, Feature.class, jsc.hadoopConfiguration()).values();
JavaPairRDD<Long, Iterable<Feature>> groupByRDD = featureJavaRDD.groupBy(...)
//how to save groupByRDD results back to HDFS with files by key

my question is: what's the best way to achieve this? I know the answer probably involves hadoop saveAsNewAPIHadoopFile and MultipleOutputs.

score 0 · Answer 1 · answered Aug 08 '17 at 22:03

I had a similar use case few days ago,I resolved it in Java by writing two custom classes implemeting MultipleTextOutputFormat and RecordWriter.

My input was a JavaPairRDD<String, List<String>> and I wanted to store it in a file named by its key, with all the lines contained in its value. (So, this is almost the same use case)

Here is the code for my MultipleTextOutputFormat implementation

class RDDMultipleTextOutputFormat<K, V> extends MultipleTextOutputFormat<K, V> {

    @Override
    protected String generateFileNameForKeyValue(K key, V value, String name) {
        return key.toString(); //The return will be used as file name
    }

    /** The following 4 functions are only for visibility purposes                 
    (they are used in the class MyRecordWriter) **/
    protected String generateLeafFileName(String name) {
        return super.generateLeafFileName(name);
    }

    protected V generateActualValue(K key, V value) {
        return super.generateActualValue(key, value);
    }

    protected String getInputFileBasedOutputFileName(JobConf job,     String name) {
        return super.getInputFileBasedOutputFileName(job, name);
        }

    protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException {
        return super.getBaseRecordWriter(fs, job, name, arg3);
    }

    /** Use my custom RecordWriter **/
    @Override
    RecordWriter<K, V> getRecordWriter(final FileSystem fs, final JobConf job, String name, final Progressable arg3) throws IOException {
    final String myName = this.generateLeafFileName(name);
        return new MyRecordWriter<K, V>(this, fs, job, arg3, myName);
    }
}

Here is the code for my RecordWriter implementation.

class MyRecordWriter<K, V> implements RecordWriter<K, V> {

    private RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat;
    private final FileSystem fs;
    private final JobConf job;
    private final Progressable arg3;
    private String myName;

    TreeMap<String, RecordWriter<K, V>> recordWriters = new TreeMap();

    MyRecordWriter(RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat, FileSystem fs, JobConf job, Progressable arg3, String myName) {
        this.rddMultipleTextOutputFormat = rddMultipleTextOutputFormat;
        this.fs = fs;
        this.job = job;
        this.arg3 = arg3;
        this.myName = myName;
    }

    @Override
    void write(K key, V value) throws IOException {
        String keyBasedPath = rddMultipleTextOutputFormat.generateFileNameForKeyValue(key, value, myName);
        String finalPath = rddMultipleTextOutputFormat.getInputFileBasedOutputFileName(job, keyBasedPath);
        Object actualValue = rddMultipleTextOutputFormat.generateActualValue(key, value);
        RecordWriter rw = this.recordWriters.get(finalPath);
        if(rw == null) {
            rw = rddMultipleTextOutputFormat.getBaseRecordWriter(fs, job, finalPath, arg3);
            this.recordWriters.put(finalPath, rw);
        }
        List<String> lines = (List<String>) actualValue;
        for (String line : lines) {
            rw.write(null, line);
        }
    }

    @Override
    void close(Reporter reporter) throws IOException {
        Iterator keys = this.recordWriters.keySet().iterator();

        while(keys.hasNext()) {
            RecordWriter rw = (RecordWriter)this.recordWriters.get(keys.next());
            rw.close(reporter);
        }

        this.recordWriters.clear();
    }
}

Most of the code is exactly the same than in FileOutputFormat. The only difference is those few lines

List<String> lines = (List<String>) actualValue;
for (String line : lines) {
    rw.write(null, line);
}

These lines allowed me to write each line of my input List<String> on the file. The first argument of the write function is set to null in order to avoid writting the key on each line.

To finish, I only need to do this call to write my files

javaPairRDD.saveAsHadoopFile(path, String.class, List.class, RDDMultipleTextOutputFormat.class);

save groupBy rdd result back to HDFS

1 Answers1