13

There is a data lake of CSV files that's updated throughout the day. I'm trying to create a Spark Structured Streaming job with the Trigger.Once feature outlined in this blog post to periodically write the new data that's been written to the CSV data lake in a Parquet data lake.

Here's what I have:

val df = spark
  .readStream
  .schema(s)
  .csv("s3a://csv-data-lake-files")

The following command wrote all the data to the Parquet lake, but didn't stop after all the data was written (I had to manually cancel the job).

processedDf
  .writeStream
  .trigger(Trigger.Once)
  .format("parquet")
  .option("checkpointLocation", "s3-path-to-checkpoint")
  .start("s3-path-to-parquet-lake")

The following job also worked, but didn't stop after all the data was written either (I had to manually cancel the job):

val query = processedDf
  .writeStream
  .trigger(Trigger.Once)
  .format("parquet")
  .option("checkpointLocation", "s3-path-to-checkpoint")
  .start("s3-path-to-parquet-lake")

query.awaitTermination()

The following command stopped the query before any data got written.

val query = processedDf
  .writeStream
  .trigger(Trigger.Once)
  .format("parquet")
  .option("checkpointLocation", "s3-path-to-checkpoint")
  .start("s3-path-to-parquet-lake")

query.stop()

How can I configure the writeStream query to wait until all the incremental data has been written to Parquet files and then stop?

zero323
  • 322,348
  • 103
  • 959
  • 935
Powers
  • 18,150
  • 10
  • 103
  • 108
  • 4
    What do you mean by "didn't stop"? This is a *streaming* job, it isn't supposed to stop, just be *triggered once a day*. – Yuval Itzchakov Aug 16 '17 at 07:34
  • 1
    @YuvalItzchakov- I would like to spin up a cluster, write the new data in the CSV lake to the Parquet lake, and then shut down the cluster. I was assuming the writeStream process would stop. In the Databricks blog post (https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html), the "Scheduling Runs with Databricks" section has an image that shows jobs with a set duration and a status of succeeded. If the writeStream job keeps running, then the cluster won't shut down. I think I must be missing something. – Powers Aug 16 '17 at 09:57
  • 1
    You're not missing anything, I looked at the code and it does seem that the query should terminate after executing the single job. – Yuval Itzchakov Aug 16 '17 at 13:34
  • I cannot reproduce it. Could you create a reproducer using the local file system? – zsxwing Aug 17 '17 at 20:36
  • 1
    @Powers I am also facing the same issue. Ideally, the streaming query should stop after the job is done. But, it doesn't stop. In fact, it keeps running and also keeps the connections open. – himanshuIIITian Aug 19 '17 at 18:47
  • @Powers To spin up a cluster, write the new data in the CSV lake to the Parquet lake, and then shut down the cluster I believe you should be using Databricks Delta or Databricks Notebooks to get this kind of feature. Databricks’ Jobs scheduler takes care of these kinds of things. – Achilleus Oct 31 '17 at 20:28
  • Are you still having issues with this? How many CSV files do you have in `s3a://csv-data-lake-files`? – Silvio Jan 05 '18 at 01:35
  • Yep @Silvio, still having the issue. The data lake has tens of thousands of CSV files. It seems like Spark works really poorly with CSV files, so I think the easiest work-around is to just figure out how to avoid using CSV in the first place. – Powers Jan 05 '18 at 01:40
  • You're saying it's still running after all the data is processed, how do you verify that? If you run `query.isActive` is it `true`? What's `query.lastProgress` show? – Silvio Jan 05 '18 at 01:53
  • @Powers. Still, do you have the issue ?? It should stop after incremental load is done with Trigger.Once feature. Me too have the same issue, It is working while consuming the data from kafka topics as mentioned in the documentation on Databricks. But moving data from one storage another storage, it is not working, the job is not stopping after the it has read and stored the data. It is still continuing. Did you find the solution for it? Please let me know – R Pidugu Dec 28 '22 at 09:49

3 Answers3

5

I got Structured Streaming + Trigger.Once to work properly on a Parquet data lake.

I don't think it was working with the CSV data lake because the CSV data lake had a ton of small files in nested directories. Spark does not like working with small CSV files (I think it needs to open them all to read the headers) and really hates when it needs to glob S3 directories.

So I think the Spark Structured Streaming + Trigger.Once code is good - they just need to make the CSV reader tech better.

Powers
  • 18,150
  • 10
  • 103
  • 108
  • which writeStream configuration did you manage to make it work properly? we are facing the same issue – user1171632 Oct 17 '19 at 13:36
  • @user1171632- I am using `writeStream` with `trigger(Trigger.Once)` as described in this blog post: https://mungingdata.com/apache-spark/structured-streaming-trigger-once/ – Powers Oct 18 '19 at 09:52
2

The main purpose of structured streaming is to process data continuously without a need to start/stop streams when new data arrives. Read this for more details.

Starting from Spark 2.0.0 StreamingQuery has method processAllAvailable that waits for all source data to be processed and committed to the sink. Please note that scala docs states to use this method for testing purpose only.

Therefore the code should look like this (if you still want it):

query.processAllAvailable
query.stop
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49
  • 4
    There is a purpose to that. When you want to save money because having an idle cluster is throwing money, then it becomes extremely valuable to trigger Once – blade Oct 06 '20 at 14:39
0

the solution must include an external trigger like a AWS event or something else that runs the job. Once the just is run it will pickup what is new, looking at the checkpoint. You can use things like airflow to run it on a schedule also. Databricks has a job scheduler. so you have two choices

  1. run at a schedule like once a hour or a day using tools like airflow or databricks job scheduler.

  2. use something like AWS s3 write event to trigger the job.

the downside of the 1) is you might be spinning up clusters for nothing and paying $$.

the downside of 2) is it's more complex. You can have a queue-like structure to ensure those messages are not lost. The positive is databricks has autoloader which does all this for you for some sources. autoloader can be a continues streaming or run once style.

Brian
  • 848
  • 10
  • 32