0

I've seen plenty of examples of how to read JSON and parquet files using structured streaming, but I have some transactions that gets input into a database. How do I read that data that coming into that table?

I know it's a vague question but I couldn't find any examples online for this so I wasn't sure how to ask the question. I need to get that data and perform some aggregation on it in real time.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • To immediately capture data entry into a database, you have two basic options: 1. Enhance the process that inserts the data so that it also posts it to a streaming sink; 2. Create a trigger on the database that pushes the data somewhere else. Note that realistically a trigger can only push data to another table, or database object. It's very tricky (and not recommended) to get a trigger to push data to something external like a streaming sink. You will need to define the database that you are using for a more technically specific answer. – Nick.Mc Jul 13 '21 at 01:50
  • Is it an option to mount your database and monitor the mounted directory with structured streaming? See this for an example [how to use a folder as input for Spark Structured Streaming](https://stackoverflow.com/questions/46198243/how-to-continuously-monitor-a-directory-by-using-spark-structured-streaming) – jrip Jul 13 '21 at 07:09
  • @Nick.McDermaid can you explain how option 2 would work with `structured streaming`? That trigger would be independent of the of the .readStream() correct? I would just need to perform an ETL process to a new table and then I could read the data stream, correct? – UndefinedKid01 Jul 13 '21 at 11:56
  • Depending on how extensively you can modify the database and what type of database it is, I would 1. Use a trigger to write to a log table; 2. Use an external process to batch, extract and process changes from that log table. If for example the table being written to has a "datetime updated" field, you could read directly out of the table, possibly directly using spark. It wouldn't exactly be streaming it would be "microbatching" – Nick.Mc Jul 13 '21 at 12:36
  • This database would be in AWS, probably using MSSQL and Postgres. – UndefinedKid01 Jul 13 '21 at 18:34
  • Do you own the source database? can you make changes to it? Is there a date updated in the source table? – Nick.Mc Jul 13 '21 at 23:24

0 Answers0