How to create zip files from ZipOutputStream in hadoop MapReduce

Question

My MapReduce has to read records from HBase and need to write into zip files. Our client has asked specifically that the reducer output files should be .zip files only.

For this I have written the ZipFileOutputFormat wrapper to compress the records and write into the zip files.

Also we can't use buffer and keep all lines into buffer and then iterate because some file contains 19GB of records and at that time it will throw a java.lang.OutOfMemoryError.

All seems ok but there is one problem:

The .zip file is getting created for each key. Inside my output file I can see many output files and those are separates file per row key. I don't know how to combined it inside the zip file.

Here is my implementation of the ZipFileOutputFormat.java

public class ZipFileOutputFormat<K, V> extends FileOutputFormat<K, V> {

    public static class ZipRecordWriter<K, V> extends org.apache.hadoop.mapreduce.RecordWriter<K, V> {

        private ZipOutputStream zipOut;

        public ZipRecordWriter(FSDataOutputStream fileOut) {
            zipOut = new ZipOutputStream(fileOut);
        }
        @Override
        public void close(TaskAttemptContext context) throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            zipOut.closeEntry();
            zipOut.finish();
            zipOut.close();
            zipOut.flush();
        }
        @Override
        public void write(K key, V value) throws IOException {
            String fname = null;
            if (key instanceof BytesWritable) {
                BytesWritable bk = (BytesWritable) key;
                fname = new String(bk.getBytes(), 0, bk.getLength());
            } else {
                fname = key.toString();
            }
            ZipEntry ze = new ZipEntry(fname);
            zipOut.closeEntry();
            zipOut.putNextEntry(ze);

            if (value instanceof BytesWritable) {
                zipOut.write(((BytesWritable) value).getBytes(), 0, ((BytesWritable) value).getLength());
            } else {
                zipOut.write(value.toString().getBytes());
            }

        }

    }

    //
    // @Override
    // public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf
    // job,
    // String name, Progressable progress) throws IOException {
    // Path file = FileOutputFormat.getTaskOutputPath(job, name);
    // FileSystem fs = file.getFileSystem(job);
    // FSDataOutputStream fileOut = fs.create(file, progress);
    // return new ZipRecordWriter<K, V>(fileOut);
    // }

    @Override
    public org.apache.hadoop.mapreduce.RecordWriter<K, V> getRecordWriter(TaskAttemptContext job)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        Configuration conf = job.getConfiguration();
        getOutputCommitter(job);

        getOutputName(job);

        Path file = getDefaultWorkFile(job, ".zip");
        // Path file = new Path(committer.getWorkPath()+"/"+fileName);

        FileSystem fs = file.getFileSystem(conf);
        FSDataOutputStream fileOut = fs.create(file);

        return new ZipRecordWriter<K, V>(fileOut);

    }
}

Have you tried using the `file` utility on Linux (or equivalent) see if you can identify the *actual* file type? Does it think it is a ZIP file? — Stephen C, Mar 03 '17 at 11:58
From Linux command line i am able to zip correctly .Not tried from HDFS — Sudarshan kumar, Mar 03 '17 at 13:17
You're always creating entries with the same name and contents. You're hardly likely to end up with valid names that way. — user207421, Mar 06 '17 at 05:49
You can read the files using spark textFile function. Then write those filesinto HDFS. I tried with Spark , I am able to merge small .gz files using Spark. — Vijay_Shinde, Mar 08 '17 at 06:45
You just write things, why can't you use a bufferwriter and flush (https://docs.oracle.com/javase/7/docs/api/java/io/BufferedWriter.html#flush()) at periodic intervals? — Adonis, Mar 09 '17 at 12:12
@Vijay_Shinde so you are suggesting to merge all zip files into one ?If so will it not take more time ? — Sudarshan kumar, Mar 11 '17 at 13:19
if you want output in a single ZIP rather than separate zips according to the keys (as you're getting it right now), you can give this current output to another reducer where value would be anything say "test" and value will be this keys. So now there is one key and the whole value, i guess it would do it. — Ankush Rathi, Mar 13 '17 at 11:01

How to create zip files from ZipOutputStream in hadoop MapReduce

0 Answers0