Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

Question

For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.

The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:

File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.

Any help will be appreciated.

Python API gets the data from remote server in the form of files. — pradeep reddy b, Dec 25 '19 at 15:16
So just adding files to a directory. But at rest? What do you mean bu updating? — thebluephantom, Dec 25 '19 at 15:20
Files are categorized into Major files and Minor Files.When only minor files comes we need to append the data to the existing directory,if a Major files comes need to delete the data in the directory and add new data.The Spark streaming job reads this directory as second input .When ever this refresh is happening i am facing the issue mentioned. — pradeep reddy b, Dec 25 '19 at 15:27
Add the data and do a distinct, but how big is the data for 2nd source? — thebluephantom, Dec 25 '19 at 15:39
Currently size of the major file varies from 300 MB to 400 MB. — pradeep reddy b, Dec 25 '19 at 16:04
how much data on input per microbatch? can u check broadcast has join takes place? — thebluephantom, Dec 25 '19 at 18:28

score 0 · Accepted Answer · answered Dec 25 '19 at 19:50

0

Use HBase as your store for static. It is more work for sure but allows for concurrent updating.

Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.

See https://medium.com/@anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc

If you have a small static ref table, then static join is fine, but you also have updating, causing issues.

answered Dec 25 '19 at 19:50

thebluephantom

16,458
8
40
83

Thanks for the info.Appreciate your help.Could you let me know how find the size of a micro batch in spark streaming – pradeep reddy b Dec 26 '19 at 03:09
https://stackoverflow.com/questions/51731800/spark-structured-streaming-kafka-microbatch-count – thebluephantom Dec 26 '19 at 10:12

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

1 Answers1