Our ETL pipeline is using spark structured streaming to enrich incoming data (join with static dataframes) before storing to cassandra. Currently the lookup tables are csv files(in HDFS) which get loaded as dataframes and joined with each batch of data on every trigger. It seems lookup-table Dataframes are broadcasted on every trigger and stored in Memory store. This is eating up the executor memory and eventually the executor face OOM and is killed by Mesos: Log of executor
As can be seen in the link above, the lookup-table dataframes to be joined with are being stored as broadcast variables and the executor is killed due to OOM.
The following is the driver log at the same time: Driver Log
The following are the Spark configurations: Spark Conf
Is there any better approach for joining with static datasets in spark structured streaming? Or how to avoid the executor OOM in the above case?