How to use structured streaming to read incoming data from a database

Question

I've seen plenty of examples of how to read JSON and parquet files using structured streaming, but I have some transactions that gets input into a database. How do I read that data that coming into that table?

I know it's a vague question but I couldn't find any examples online for this so I wasn't sure how to ask the question. I need to get that data and perform some aggregation on it in real time.

To immediately capture data entry into a database, you have two basic options: 1. Enhance the process that inserts the data so that it also posts it to a streaming sink; 2. Create a trigger on the database that pushes the data somewhere else. Note that realistically a trigger can only push data to another table, or database object. It's very tricky (and not recommended) to get a trigger to push data to something external like a streaming sink. You will need to define the database that you are using for a more technically specific answer. — Nick.Mc, Jul 13 '21 at 01:50
Is it an option to mount your database and monitor the mounted directory with structured streaming? See this for an example [how to use a folder as input for Spark Structured Streaming](https://stackoverflow.com/questions/46198243/how-to-continuously-monitor-a-directory-by-using-spark-structured-streaming) — jrip, Jul 13 '21 at 07:09
@Nick.McDermaid can you explain how option 2 would work with `structured streaming`? That trigger would be independent of the of the .readStream() correct? I would just need to perform an ETL process to a new table and then I could read the data stream, correct? — UndefinedKid01, Jul 13 '21 at 11:56
Depending on how extensively you can modify the database and what type of database it is, I would 1. Use a trigger to write to a log table; 2. Use an external process to batch, extract and process changes from that log table. If for example the table being written to has a "datetime updated" field, you could read directly out of the table, possibly directly using spark. It wouldn't exactly be streaming it would be "microbatching" — Nick.Mc, Jul 13 '21 at 12:36
This database would be in AWS, probably using MSSQL and Postgres. — UndefinedKid01, Jul 13 '21 at 18:34
Do you own the source database? can you make changes to it? Is there a date updated in the source table? — Nick.Mc, Jul 13 '21 at 23:24

How to use structured streaming to read incoming data from a database

0 Answers0