4

I was writing a MapReduce code in which I had to read the file name as key and the file contents as its value. For this I posted this question on StackOverflow. It worked file for text files but started giving problems with gzipped files. So referring the LineRecordReader class I made some modifications in my code. The code snippet is:

public class WholeFileRecordReader extends RecordReader<Text, BytesWritable> {

    private CompressionCodecFactory compressionCodecs = null;
    private FileSplit fileSplit;
    private Configuration conf;
    private InputStream in;
    private Text key = new Text("");
    private BytesWritable value = new BytesWritable();
    private boolean processed = false;

    @Override
    public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {

        this.fileSplit = (FileSplit) split;
        this.conf = context.getConfiguration();

        final Path file = fileSplit.getPath();
        compressionCodecs = new CompressionCodecFactory(conf);

        final CompressionCodec codec = compressionCodecs.getCodec(file);
        System.out.println(codec);
        FileSystem fs = file.getFileSystem(conf);
        in = fs.open(file);

        if (codec != null) {
            in = codec.createInputStream(in);
        }
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (!processed) {
            byte[] contents = new byte[(int) fileSplit.getLength()];
            Path file = fileSplit.getPath();
            key.set(file.getName());

            try {
                IOUtils.readFully(in, contents, 0, contents.length);
                value.set(contents, 0, contents.length);
            } finally {
                IOUtils.closeStream(in);
            }

            processed = true;
            return true;
        }

        return false;
    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException {
        return processed ? 1.0f : 0.0f;
    }

    @Override
    public void close() throws IOException {
        // Do nothing
    }

}

The problem is that I am getting the value of codec object as null though the file is a gz file. One thing to note is that I have appended the files with dates in the end for my own purpose. But I felt that this shouldn't be a problem because I heard that Unix doesn't use extensions to determine file types.

Can someone please tell me what's the problem?

Community
  • 1
  • 1
aa8y
  • 3,854
  • 4
  • 37
  • 62

1 Answers1

1

The CompressionCodecFactory does use file extensions to determine which codec to use - so if the file ends in .gz then the GzipCodec should be returned when the call to getCodec is made. If you have a .gz.2012-01-24 extension, then this will not return the gzip codec. So you need to amend your file naming convention to swap the date and extension.

Chris White
  • 29,949
  • 4
  • 71
  • 93