Amazon Hadoop EMR & custom input file format

Question

I am having a bit of trouble getting Amazon EMR accepting a custom InputFileFormat:

public class Main extends Configured implements Tool {

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new JobConf(), new Main(), args);
        System.exit(res);
    }


    public int run(String[] args) throws Exception {

        Path inputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);

        System.out.println("Input  path: "+inputPath+"\n");
        System.out.println("Output path: "+outputPath+"\n");

        Configuration conf = getConf();
        Job job = new Job(conf, "ProcessDocs");

        job.setJarByClass(Main.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setInputFormatClass(XmlInputFormat.class);

        TextInputFormat.setInputPaths(job, inputPath);
        TextOutputFormat.setOutputPath(job, outputPath);

        job.waitForCompletion(true);

        return 0;
    }   
}

Looking at the log file:

2012-06-04 23:35:20,053 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null
2012-06-04 23:35:20,054 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 6
2012-06-04 23:35:20,054 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 1
2012-06-04 23:35:20,767 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 1
2012-06-04 23:35:20,813 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2012-06-04 23:35:20,886 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2012-06-04 23:35:20,886 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2012-06-04 23:35:20,906 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2012-06-04 23:35:20,906 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2012-06-04 23:35:22,240 INFO org.apache.hadoop.mapred.JobClient (main): Running job: job_201206042333_0001

It seems that Hadoop on EMR assumes the default InputFileFormat reader... What am I doing wrong?

Note: I do not get errors from Hadoop regarding the availability of the XmlInputClass. *Note2: * I get <property><name>mapreduce.inputformat.class</name><value>com.xyz.XmlInputFormat</value></property> in the jobs/some_job_id.conf.xml file.

Update:

public class XmlInputFormat extends TextInputFormat {

  public static final String START_TAG_KEY = "xmlinput.start";
  public static final String END_TAG_KEY = "xmlinput.end";

  public RecordReader<LongWritable,Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {

      System.out.println("Creating a new 'XmlRecordReader'");

      return new XmlRecordReader((FileSplit) split, context.getJobConf());
  }

  /*
  @Override
  public RecordReader<LongWritable,Text> getRecordReader(InputSplit inputSplit,
                                                         JobConf jobConf,
                                                         Reporter reporter) throws IOException {
    return new XmlRecordReader((FileSplit) inputSplit, jobConf);
  }
  */

  /**
   * XMLRecordReader class to read through a given xml document to output xml
   * blocks as records as specified by the start tag and end tag
   * 
   */
  public static class XmlRecordReader implements RecordReader<LongWritable,Text> {
    private final byte[] startTag;
    private final byte[] endTag;
    private final long start;
    private final long end;
    private final FSDataInputStream fsin;
    private final DataOutputBuffer buffer = new DataOutputBuffer();

    public XmlRecordReader(FileSplit split, JobConf jobConf) throws IOException {
      startTag = jobConf.get(START_TAG_KEY).getBytes("utf-8");
      endTag = jobConf.get(END_TAG_KEY).getBytes("utf-8");

      System.out.println("XmlInputFormat: Start Tag: "+startTag);
      System.out.println("XmlInputFormat: End Tag  : "+endTag);

      // open the file and seek to the start of the split
      start = split.getStart();
      end = start + split.getLength();
      Path file = split.getPath();
      FileSystem fs = file.getFileSystem(jobConf);
      fsin = fs.open(split.getPath());
      fsin.seek(start);
    }
    ...

Judge Mental · Accepted Answer · 2012-06-05T01:05:25.677

0

If XmlInputFormat is not part of the same jar that contains main(), you'll probably need to either build it in to a "subfolder" called "lib" of your main jar, or create a bootstrap action that copies the extra jar containing XmlInputFormat from S3 into the magic folder /home/hadoop/lib which is part of the Hadoop classpath by default on EMR.

It is certainly not assuming FileInputFormat, which is abstract.

Based on your edits, I think the premise of your question is wrong. I suspect that the input format was indeed found and used. System.out.println from a task attempt will not end up in the syslog of the job, although it might appear in the stdout digest.

edited Jun 05 '12 at 01:05

answered Jun 05 '12 at 00:50

Judge Mental

5,209
17
22

It is not the case here: I have include `XmlInputFormat` in my jar. The trouble is: the class doesn't seem to be use because I have included some `System.out.println` in the constructor and I get nothing in the log file. – jldupont Jun 05 '12 at 00:52
I would expect to see `NoClassDefFoundException`s if it were indeed not finding the class. Which log file are you looking at? What you've posted appears to be the JobTracker log. – Judge Mental Jun 05 '12 at 00:56
It is: /job_id/steps/2/syslog – jldupont Jun 05 '12 at 00:58
You won't find stdout from an individual task in that log. You need to look at an attempt stdout log. You might also find that output digested in the step's stdout (job_id/steps/2/stdout). – Judge Mental Jun 05 '12 at 00:59
There is nothing in the `stdout` files :( – jldupont Jun 05 '12 at 01:05

score 0 · Answer 2 · answered May 20 '14 at 02:26

0

This is a another simple way i found to get this custom jar file run on EMR or Hadoop http://www.applams.com/2014/05/using-custom-streaming-jar-using-custom.html

answered May 20 '14 at 02:26

hbr

469
1
6
7

Amazon Hadoop EMR & custom input file format

2 Answers2