I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.
val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)
In order to perform my analysis on DStream data I have to transform it into an Array
var arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Then I use data obtained to extract the information I want and to save them on HDFS.
val myRDD = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")
Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...")
(which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...
With an arr.foreach(println)
I am able to see the correct results of the transofmations.
My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString())
but always only one foldes is created and the output files are still empty.
How can I write an RDD into HDFS in a spark-streaming context?