I am trying to load a group of files, make some checks on them and later saving them in HDFS. I haven't found a good way to create and save these Sequence files, though. Here is my loader main function
SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
.setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");
imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {
@Override
public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
throws Exception {
return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
}
}).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);
It simply loads some images as text files, puts the name of the file as key of the PairRDD and use the native saveAsNewAPIHadoopFile.
rdd.foreach or rdd.foreachPartition` but I cannot find a proper method:
I would like now to save file by file in a
- This stackoverflow answer creates a Job for the occasion. It seems to work, but it needs the file inputed as a path, while I already have an RDD made of them
- A couple of solution I found create a directory for each file (
OutputStream out = fs.create(new Path(dst));
) which wouldn't be as much of a problem, if it weren't for the fact that I get an exception forMkdirs didn't work
EDIT: I may have found a way, but I have a Task not serializable
exception:
JavaPairRDD imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
imageByteRDD.foreach(new VoidFunction<Tuple2<String,PortableDataStream>>() {
@Override
public void call(Tuple2<String, PortableDataStream> fileTuple) throws Exception {
Text key = new Text(fileTuple._1());
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(serializableConfiguration.getConf(), SequenceFile.Writer.file(new Path("/user/hdfs/sparkling/" + key)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text("MiaoMiao!");
writer.append(key, value);
IOUtils.closeStream(writer);
}
});
I have tried wrapping the entire function in a Serializable class, but no luck. Help?