How do I convert my Java Hadoop code to run on EC2?

Question

I wrote a Driver, Mapper, and Reducer class in Java that runs the k-nearest neighbor algorithm on test data, and pulls in the training set using Distributed Cache. I used a Cloudera virtual machine to test the code, and it works in pseudo-distributed mode.

I'm trying to get through Amazon's EC2/EMR documentation ... it seems like there should be a way to easily convert working Java Hadoop code into something that will work in EC2, but I'm seeing a whole bunch of custom amazon import statements and methods that I've never seen before.

Here's my driver code for an example:

        import java.net.URI;

        import org.apache.hadoop.conf.Configured;
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.filecache.DistributedCache;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.IntWritable;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.util.Tool;
        import org.apache.hadoop.util.ToolRunner;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

        public class KNNDriverEC2 extends Configured implements Tool {
            public int run(String[] args) throws Exception {

                Configuration conf = new Configuration();

                conf.setInt("rows",1000);
                conf.setInt("columns",613);


                DistributedCache.createSymlink(conf);
                // might have to start next line with ./!!!
                DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_sample.csv#train_sample.csv"),conf);
                DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_labels.csv#train_labels.csv"),conf);
                //DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
                //DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);


                Job job = new Job(conf);
                job.setJarByClass(KNNDriverEC2.class); 
                job.setJobName("KNN");

                FileInputFormat.setInputPaths(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1]));

                job.setMapperClass(KNNMapperEC2.class);
                job.setReducerClass(KNNReducerEC2.class);
                // job.setInputFormatClass(KeyValueTextInputFormat.class);

                job.setMapOutputKeyClass(IntWritable.class);
                job.setMapOutputValueClass(IntWritable.class);

                job.setOutputKeyClass(IntWritable.class);
                job.setOutputValueClass(IntWritable.class);

                boolean success = job.waitForCompletion(true);
                return success ? 0 : 1;
            }

            public static void main(String[] args) throws Exception {
                int exitCode = ToolRunner.run(new Configuration(), new KNNDriverEC2(), args);
                System.exit(exitCode);
            }
        }

I've gotten my instance running, but an exception is thrown at the line "FileInputFormat.setInputPaths(job, new Path(args[0]));". I'm going to try to work through the documentation on handling arguments, but I've run into so many errors so far I'm wondering if I'm far from making this work. Any help appreciated.

I don't know if you should change your MapReduce code for it to run on Amazon EC2. You should check the version on the EC2 instance that you have to make sure the API is the same as your development environment. If you still run into problems, you could just package the jar with dependencies and just port the jar file over to the EC2 instance and run it. — Chaos, Mar 04 '14 at 01:40
Using EMR you do not need to do anything new. Just you can use the same jar. Check this article's second half: http://www.bigdataspeak.com/2012/12/emr-streaming-job-using-java-code-for_27.html — Amar, Mar 20 '14 at 10:35

How do I convert my Java Hadoop code to run on EC2?

0 Answers0