I am trying to migrate the following Hadoop job to Spark.
public class TextToSequenceJob {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(Mapper.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Mapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
job.submit();
}
}
This is my Spark solution so far:
public final class text2seq2 {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local").setAppName("txt2seq").set("spark.executor.memory", "1g");
sparkConf.set("spark.executor.memory", "1g");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input");
infile.saveAsHadoopFile("output", LongWritable.class, String.class, SequenceFileOutputFormat.class);
ctx.stop();
}
}
But I get this error:
java.io.IOException: Could not find a serializer for the Value class: 'java.lang.String'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1187)
Anyone know what this means?