Note for those who are confused.
- There is a difference between "reading from some source" (
spark.read.format('...')
) and "streaming from some source" (spark.readStream.format('...')
). - Batch-read and structured-streaming are VERY different things.
- This question is about streaming not batch-reading.
I want to read data from an on-prem Oracle schema (from multiple joined tables) as a Spark stream. E.g. How to create a custom streaming data source?
I thought to implement a new DataSourceV2 that will read from Oracle and store check point information (so I can invoke it at a flexible schedule and it'll know where to resume stream from) etc. and make my code look clean like:
streaming_oracle_df = spark.readStream \
.format("custom_oracle") \ # <- my custom format
.option("oracle_jdbc_str", "jdbc:...") \
.option("custom_option1", 123) \
.load()
streaming_oracle_df.writeStream \
.trigger(availableNow=True) \
.format('delta') \
.option('checkpointLocation', 's3://bucket/checkpoint/dim_customer') \
.start('s3://bucket/tables/dim_customer')
Is it possible to write this in Python? Or does it have to be Java? Is Scala an option too?