How to continuously monitor a directory by using Spark Structured Streaming

Question

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.

Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.

Naman Agarwal · Accepted Answer · 2017-09-14T07:47:28.747

Here is the complete Solution for this use Case:

If you are running in stand alone mode. You can increase the driver memory as:

bin/spark-shell --driver-memory 4G

No need to set the executor memory as in Stand Alone mode executor runs within the Driver.

As Completing the solution of @T.Gaweda, find the solution below:

val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

csvDf.writeStream.format("console").option("truncate","false").start()

now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.

Note: If you want spark to inferschema you have to first set the following configuration:

spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌e","true")

where spark is your spark session.

this works however, down to the implementation how did spark figure out the new files? does it list all files recursively and compare to previous memorized state? if so, this will be very inefficient to deal with a huge file hierarchy right? — linehrr, Apr 17 '19 at 19:54
after some source code reading, spark does read the whole folder(if no metadata presented), and then compare against the previous status, refer to source class `FileStreamSource.scala`. — linehrr, Apr 17 '19 at 21:14

T. Gawęda · Answer 2 · 2017-09-13T13:25:26.160

4

As written in official documentation you should use "file" source:

File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.

Code example taken from documentation:

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

If you don't specify trigger, Spark will read new files as soon as possible

edited Sep 13 '17 at 13:25

answered Sep 13 '17 at 13:00

T. Gawęda

15,706
4
46
61

I have question . will this work for reading avro files as well ? Does this support google cloud storage as well i.e i want to similarly process new files coming up in my gcs bucket ?. Is this method fault tolerant i.e if the pipeline fails ,how to recover , how to know what files were processed and which are the new ones ? – user179156 Jan 05 '18 at 09:52
in the case of text/json , if my streaming pipeline fails , how does the new streaming pipeline knows where to start consuming files from ? – user179156 Jan 05 '18 at 09:54
how to tell spark if a file is being written and wait till file write operation is completed. – dev Aug 09 '19 at 05:51

How to continuously monitor a directory by using Spark Structured Streaming

2 Answers2

Linked