0

Using the example from https://github.com/sutugin/spark-streaming-jdbc-source I've attempted to connect to a Postgres database as a streaming source in AWS Databricks.

I have a cluster running: 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)

This library is installed on my cluster: org.apache.spark:spark-streaming_2.12:3.3.2

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
.builder
.appName("StructuredJDBC")
.getOrCreate()

import spark.implicits._

val jdbcOptions = Map(
"user" -> "myusername",
"password" -> "mypassword",
"database" -> "testDB",
"driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql://dbhostname:5432:mem:myDb;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false"
)

// Create DataFrame representing the stream of input lines from jdbc
val stream = spark.readStream
.format("jdbc-streaming")
.options(jdbcOptions + ("dbtable" -> "dimensions_test_table") + ("offsetColumn" -> "loaded_timestamp"))
.load

// Start running the query that prints 'select result' to the console
val query = stream.writeStream
.outputMode("append")
.format("console")
.start()

query.awaitTermination()

But I'm plagued with the error: NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport Caused by: ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport

The only info I can find on this error doesn't appear to apply to my situation. What am I missing?

I've looked for other libraries, but this appears to be the only one that supports jdbc as a source on Scala 2.12.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Michael Woods
  • 11
  • 1
  • 2

1 Answers1

1

There are few problems here:

  • You don't need to install the org.apache.spark:spark-streaming_2.12:3.3.2 library on Databricks cluster. Databricks runtime includes all necessary Spark libraries, and by installing the open source version you most probably will break Databricks-specific modifications.

  • To use this library you need to compile it yourself and install onto the cluster. But as I see, it wasn't updated for 4 years, and by default it's compiled for Spark 3.0 (that matches to DBR 7.3).

If you want to get changes from database, you may look onto Change Data Capture functionality, like, CDC for RDS MySQL. Then data could be landed to S3, and picked up for example, with Delta Live Tables implementing CDC pattern.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132