7

I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.

val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)

In order to perform my analysis on DStream data I have to transform it into an Array

var arr = new ArrayBuffer[String]();
   data.foreachRDD {
   arr ++= _.collect()
}

Then I use data obtained to extract the information I want and to save them on HDFS.

val myRDD  = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")

Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...") (which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...

With an arr.foreach(println) I am able to see the correct results of the transofmations.

My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString()) but always only one foldes is created and the output files are still empty.

How can I write an RDD into HDFS in a spark-streaming context?

drstein
  • 1,113
  • 1
  • 10
  • 24
  • I guess the problem is that your arr is not available on all workers. Did you try to broadcast your arr and then finally write it into hdfs? – Hafiz Mujadid Jul 02 '15 at 11:28
  • because i need to monitor a folder and intercept all the new file uploaded and spark streaming sounds like a good solution. It's not a single machine but a 2 machines-cluster. Now i'm just writing files as text but in the future i will have to write parquet files and it's pretty straightforward with Spark – drstein Jul 02 '15 at 11:33
  • Will you try this? var arr = new ArrayBuffer[String](); val broadcasted = sc.broadcast(arr) data.foreachRDD { broadcasted ++= _.collect() } val myRDD = sc.parallelize(broadcasted ) myRDD.saveAsTextFile("hdfs directory....") – Hafiz Mujadid Jul 02 '15 at 11:53
  • Thanks, i tried but with no results. I guess I have to use just DStream – drstein Jul 02 '15 at 12:22

2 Answers2

7

You are using Spark Streaming in a way it wasn't designed. I'd either recommend drop using Spark for your use case, or adapt your code so it works the Spark way. Collecting the array to the driver defeats the purpose of using a distributed engine and makes your app effectively single-machine (two machines will also cause more overhead than just processing the data on a single machine).

Everything you can do with an array, you can do with Spark. So just run your computations inside the stream, distributed on the workers, and write your output using DStream.saveAsTextFiles(). You can use foreachRDD + saveAsParquet(path, overwrite = true) to write to a single file.

Marius Soutier
  • 11,184
  • 1
  • 38
  • 48
  • Thanks, I totally get your point, I will try to change the trasform logic in order to use DStream. Do you know if it's possible for spark-streaming in every batch to save records in the same file? Right now I get a new folder with new files every batch interval. – drstein Jul 02 '15 at 12:21
  • 1
    Yes, with foreachRDD + saveAsParquet there's an option to overwrite. – Marius Soutier Jul 02 '15 at 12:28
1

@vzamboni: Spark 1.5+ dataframes api has this feature :

dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121