I have implemented my own structured streaming data source in spark against a proprietary vendor messaging system. It is using V2 of the structured streaming API implementing MicroBatchReadSupport and DataSourceRegister. I modeled it much after some examples found here. I also followed the advice given at this stack overflow post. At first everything seems to be starting up properly when I call load on the readStream. However, when I try to direct the query to a writeStream, it tries to instantiate another MicroBatchReadSupport. This actually fails fast because I have a check in the createMicroBatchReader method to see if there was a schema provided, and if not throw an exception. And in the case of the second call to createMicroBatchReader, a schema isn't even provided even though the initial query did provide one. My code to start the stream (following closely examples from spark documentation) looks like the following
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Streamer {
def main(args: Array[String]): Unit = {
val schema = StructType(
StructField("TradeId", LongType, nullable = True) ::
StructField("Source", StringTYpe, nullable = True :: Nil
)
val spark = SparkSession
.builder
.getOrCreate()
val ampsStream = spark.readStream
.format("amps")
.option("topic", "/test")
.option("server", "SOME_URL")
.schema(schema)
.load()
ampsStream.printSchema()
val query = ampsStream.writeStream.format("console").start()
query.awaitTermination()
}
}
I've put breaks and debug statements in to test, and it gets called again right when I get to the writeStream.start. As mentioned the odd thing two is the second time around the Optional variable that is passed into createMicroBatchReader is empty, where as the first call properly has the schema. Any guidance would be greatly appreciated.
EDIT: I added some debugging statements and tested this out with the above mentioned repo @ https://github.com/hienluu/wikiedit-streaming and I see the exact same issue when running WikiEditSourceV2Example.scala from this repo. Not sure if this is a bug, or if me and the author of the aforementionned repo are missing something.
EDIT 2: Adding the code for the amps streaming source import java.util.Optional
import org.apache.spark.internal.Logging
import org.apache.spark.sql.sources.DataSourceRegister
import org.apache.spark.sql.sources.v2.reader.streaming.{MicroBatchReader, Offset}
import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, MicroBatchReadSupport}
import org.apache.spark.sql.types.StructType
class AmpsStreamingSource extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister {
override def shortName(): String = "amps"
override def createMicroBatchReader(schema: Optional[StructType],
checkpointLocation: String,
options: DataSourceOptions): MicroBatchReader = {
println("AmpsStreamingSource.createMicroBatchReader was called")
if(schema.isPresent) new AmpsMicroBatchReader(schema.get, options)
throw new IllegalArgumentException("Must proivde a schema for amps stream source")
}
}
and the signature of AmpsMicroBatchReader
class AmpsMicroBatchReader(schema: StructType, options: DataSourceOptions)
extends MicroBatchReader with MessageHandler with Logging