Merging two SequenceFiles in hadoop?

Question

I am trying to implement iterative MapReduce in hadoop. The result from first MapReduce job is a MapWritable containing two DoubleArrayWritable. Part of my first mapper is :

DoubleWritable[][] Tdata = new DoubleWritable[T.numRows()][T.numColumns()];
    for (int k = 0; k < Tdata.length; k++) {
        for (int j = 0; j < Tdata[k].length; j++) {
            Tdata[k][j] = new DoubleWritable(T.get(k, j));
        }
    }
    DoubleArrayWritable t = new DoubleArrayWritable();
    t.set(Tdata);

    DoubleWritable[][] Hdata = new DoubleWritable[H.numRows()][H.numColumns()];
    for (int k = 0; k < Hdata.length; k++) {
        for (int j = 0; j < Hdata[k].length; j++) {
            Hdata[k][j] = new DoubleWritable(H.get(k, j));
        }
    }
    DoubleArrayWritable h = new DoubleArrayWritable();
    h.set(Hdata);

    mw.put(new IntWritable(0), h);
    mw.put(new IntWritable(1), t);
    context.write(new Text(splitId), mw);

Through use of identity reducer I am finally getting output of mapper as it is as final Output. Now I want to use these output as input to a iterative MapReduce job. The problem is that with each iteration one global variable is getting updated and I want to pass it as input to Mappers in next iteration along with the output of first MapReduce job. Code snippet from driver class

 for(it=0;it<10;it++){ //change the stopping condition
        outPath = new Path(inPath+"_"+it); 
        // delete existing directory
        if (hdfs.exists(outPath)) {
            hdfs.delete(outPath, true);
        }       
        Job job2 = new Job(conf,"OutputWeightCalc");
        job2.setMapperClass(secMapper.class);
        job2.setMapOutputKeyClass(Text.class);
        job2.setMapOutputValueClass(MapWritable.class);
        job2.setReducerClass(finalReducer.class);
        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(MapWritable.class);
        job2.setInputFormatClass(SequenceFileInputFormat.class);
        job2.setOutputFormatClass(SequenceFileOutputFormat.class);
        FileInputFormat.addInputPath(job2, inPath);
        FileOutputFormat.setOutputPath(job2, outPath);
        job2.waitForCompletion(true);
        count = job2.getCounters();
        inPath = outPath;
    }

Now the problem is that how can I merge the two outputs in one and pass it as a inputpath to next iteration mapper?? I thought of merging two SequenceFiles created as Output of MR job, but I don't know how to do that. Someone help.

Thank you!!

Why do you need to merge the SequenceFiles vs just sending them into the next iteration as input? — Binary Nerd, Nov 28 '16 at 13:57
@BinaryNerd I thought of merging files because there is static data which will be same for mappers for each iteration but then there is state data which is changing after each iteration. I have to pass both static and state data as input to next iteration map/reduce job. — Mohini, Nov 29 '16 at 08:15

Merging two SequenceFiles in hadoop?

0 Answers0