-1

Note for those who are confused.

  • There is a difference between "reading from some source" (spark.read.format('...')) and "streaming from some source" (spark.readStream.format('...')).
  • Batch-read and structured-streaming are VERY different things.
  • This question is about streaming not batch-reading.

I want to read data from an on-prem Oracle schema (from multiple joined tables) as a Spark stream. E.g. How to create a custom streaming data source?

I thought to implement a new DataSourceV2 that will read from Oracle and store check point information (so I can invoke it at a flexible schedule and it'll know where to resume stream from) etc. and make my code look clean like:

streaming_oracle_df = spark.readStream \
    .format("custom_oracle") \                 # <- my custom format
    .option("oracle_jdbc_str", "jdbc:...") \
    .option("custom_option1", 123) \
    .load()

streaming_oracle_df.writeStream \
    .trigger(availableNow=True) \
    .format('delta') \
    .option('checkpointLocation', 's3://bucket/checkpoint/dim_customer') \
    .start('s3://bucket/tables/dim_customer')

Is it possible to write this in Python? Or does it have to be Java? Is Scala an option too?

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • Related: [github: Notes on querying Oracle from Apache Spark](https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Oracle_JDBC_Howto.md). – Memristor May 07 '23 at 11:49
  • Thanks @Memristor, but info at that link is not related. See the updated question with note at top. – Kashyap May 08 '23 at 15:56

1 Answers1

-1

Spark structured streaming does not have a standard JDBC source, so Oracle. To be able to get over with it, you should use forEachBatch method like below:

def foreach_batch_function(df, epoch_id):
    streaming_oracle_df = (spark.read
    .format("custom_oracle")
    .option("oracle_jdbc_str", "jdbc:...")
    .option("custom_option1", 123)
    .load()
    )
    (streaming_oracle_df.write
        .format('delta')
        .option('checkpointLocation', 's3://bucket/checkpoint/dim_customer')
        .start('s3://bucket/tables/dim_customer')
    )

writing_sink = (df.writeStream
    .trigger(processingTime='10 seconds')
    .foreachBatch(foreach_batch_function)
    .start()
    )

writing_sink.awaitTermination()

Thanks.

Eren Sakarya
  • 150
  • 8
  • Yeah it's not built in, hence the custom format implementation. – Kashyap May 05 '23 at 18:18
  • Would this work for you? It is only for Inserts not Updates/Deletes. "IMPORTANT: updates and deletions will only be identified if they also advance the offset field." they said. https://github.com/AbsaOSS/Jdbc2S – Eren Sakarya May 05 '23 at 18:31
  • 1
    Thanks Eren, but info at that link is not related. See the updated question with note at top. Hope it clarifies the question. – Kashyap May 08 '23 at 15:57