-1

I am trying to migrate the following Hadoop job to Spark.

public class TextToSequenceJob {

    public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {

        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(Mapper.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(Mapper.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("input"));
        FileOutputFormat.setOutputPath(job, new Path("output"));

        job.submit();
    }
}

This is my Spark solution so far:

public final class text2seq2 {

    public static void main(String[] args) throws Exception {

        SparkConf sparkConf = new SparkConf();

        sparkConf.setMaster("local").setAppName("txt2seq").set("spark.executor.memory", "1g");
        sparkConf.set("spark.executor.memory", "1g");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
        JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input");
        infile.saveAsHadoopFile("output", LongWritable.class, String.class, SequenceFileOutputFormat.class);
        ctx.stop();
    }
}

But I get this error:

java.io.IOException: Could not find a serializer for the Value class: 'java.lang.String'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1187)

Anyone know what this means?

Christian Strempfer
  • 7,291
  • 6
  • 50
  • 75
Edamame
  • 23,718
  • 73
  • 186
  • 320
  • This tells us nothing about what the actual task is. That is all in the `Mapper.class`. – Mike Park Dec 09 '14 at 17:00
  • I am using a default Mapper.class to convert a text file to a sequence file. So I am wondering how to make the same identify Mapper in Spark. Thanks! – Edamame Dec 09 '14 at 17:23
  • possible duplicate of [Convert a text file to sequence format in Spark Java](http://stackoverflow.com/questions/27353462/convert-a-text-file-to-sequence-format-in-spark-java) – Christian Strempfer Dec 09 '14 at 17:33
  • Hey, this is the same question you asked yesterday. Duplicates are discouraged. – Christian Strempfer Dec 09 '14 at 17:33
  • Christian, they are similar, but I tried different approach and configuration settings in different questions. – Edamame Dec 09 '14 at 17:45

1 Answers1

0

I figured it out myself as below. Hope it would be helpful to some people.

public final class text2seq {

    public static void main(String[] args) throws Exception {

        SparkConf sparkConf = new SparkConf();

        sparkConf.setMaster("local").setAppName("txt2seq").set("spark.executor.memory", "1g");
        sparkConf.set("spark.executor.memory", "1g");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
        JavaRDD<String> infile = ctx.textFile("input");
        JavaPairRDD<LongWritable, Text> pair = infile.mapToPair(new PairFunction<String, LongWritable, Text>() {
            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<LongWritable, Text> call(String s) {
                return new Tuple2<LongWritable, Text>(new LongWritable(), new Text(s));
            }
        });
        pair.saveAsHadoopFile("output", LongWritable.class, Text.class,
                SequenceFileOutputFormat.class);
        ctx.stop();
    }
}
Jeremy Beard
  • 2,727
  • 1
  • 20
  • 25
Edamame
  • 23,718
  • 73
  • 186
  • 320