3

Our ETL pipeline is using spark structured streaming to enrich incoming data (join with static dataframes) before storing to cassandra. Currently the lookup tables are csv files(in HDFS) which get loaded as dataframes and joined with each batch of data on every trigger. It seems lookup-table Dataframes are broadcasted on every trigger and stored in Memory store. This is eating up the executor memory and eventually the executor face OOM and is killed by Mesos: Log of executor

As can be seen in the link above, the lookup-table dataframes to be joined with are being stored as broadcast variables and the executor is killed due to OOM.

The following is the driver log at the same time: Driver Log

The following are the Spark configurations: Spark Conf

Is there any better approach for joining with static datasets in spark structured streaming? Or how to avoid the executor OOM in the above case?

  • HI, I'm also facing OOM on Spark 2.3 stream to stream Join and Stream aggregations. I saw it's known issue: https://stackoverflow.com/questions/49215321/memory-issue-with-spark-structured-streaming also an open Jira on this issue [SPARK-23682] Memory issue with Spark structured streaming - ASF JIRA [SPARK-23682] Memory issue with Spark structured streaming - ASF JIRA oure is casued by OOM of :org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider If you find any solution I will love to hear about it. – Arnon Rodman Jun 27 '18 at 05:54
  • 1
    To date, we haven't found a solution to fix the problem. However as a workaround we have created a scheduling mechanism to restart the structured streaming job after every two hours before OOM failure. This does not affect the structured streaming state as it's backed by HDFS and restores on restart. – Saad Hashmi Jul 18 '18 at 06:57
  • Thanks for the replay, how do you restart every two hours? We are running on a standalone cluster. What did you use for "scheduling mechanism"? And didn't its effects on your cluster memory / other jobs, our join was at 150Gb memory after two days and then it crashed on OOM. – Arnon Rodman Jul 18 '18 at 07:51
  • 2
    For the scheduling part, we terminate the streaming query via java executors api. We created a streaming query listener via the StreamingQueryManager ([link](https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-demo-StreamingQueryManager-awaitAnyTermination-resetTerminated.html)) which on Termination event sends a post request to a http server which then restarts the job. – Saad Hashmi Jul 18 '18 at 10:15

0 Answers0