Read a dataset stored by BinStorage (from an external tool)

Question

I have a Pig script with some computational heavy parts; I want to take those parts off and run them with some optimized MapReduce jobs.

I think it would be perfect for the MapReduce jobs to read and write directly the same data format Pig uses for store intermediate results, in order to avoid useless conversion.

I was thinking to store the data using the org.apache.pig.builtin.BinStorage storing function.

My problem is that I have no idea about how to read that format from the MapReduce job.

I tried with this code:

public class WordCount {

    public static class Map extends MapReduceBase implements Mapper<NullWritable, BinSedesTuple,  Text, IntWritable> {

        public void map(NullWritable  key, BinSedesTuple value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            //do something
            }
        }


    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            //do something
        }
    }

    public static void main(String[] args) throws Exception {
        //......
        conf.setInputFormat(SequenceFileInputFormat.class);
        conf.setOutputFormat(SequenceFileOutputFormat.class);
        //.....
    }
}

But I get this error:

java.io.IOException: hdfs://localhost:54310/user/path/to/part-m-00000 not a SequenceFile
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1517)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
        at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:191)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

Does someone known how to make it work?

score 0 · Answer 1 · edited May 23 '17 at 12:20

0

Pig doesn't output sequence files by default, so you have a couple options:

Output sequence files from your pig job for your MR job: Storing data to SequenceFile from Apache Pig
Figure out what file format Pig outputs by default and read that (maybe it's plaintext? hadoop fs -text /user/path/to/part-m-00000 | head)

edited May 23 '17 at 12:20

Community

1
1

answered Oct 29 '14 at 20:52

Chris

470
4
11

Read a dataset stored by BinStorage (from an external tool)

1 Answers1