Spark streaming query loads datasource twice

Question

I have implemented my own structured streaming data source in spark against a proprietary vendor messaging system. It is using V2 of the structured streaming API implementing MicroBatchReadSupport and DataSourceRegister. I modeled it much after some examples found here. I also followed the advice given at this stack overflow post. At first everything seems to be starting up properly when I call load on the readStream. However, when I try to direct the query to a writeStream, it tries to instantiate another MicroBatchReadSupport. This actually fails fast because I have a check in the createMicroBatchReader method to see if there was a schema provided, and if not throw an exception. And in the case of the second call to createMicroBatchReader, a schema isn't even provided even though the initial query did provide one. My code to start the stream (following closely examples from spark documentation) looks like the following

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object Streamer {

  def main(args: Array[String]): Unit = {

    val schema = StructType(
      StructField("TradeId", LongType, nullable = True) ::
      StructField("Source", StringTYpe, nullable = True :: Nil
    )

    val spark = SparkSession
      .builder
      .getOrCreate()

    val ampsStream = spark.readStream
      .format("amps")
      .option("topic", "/test")
      .option("server", "SOME_URL")
      .schema(schema)
      .load()

     ampsStream.printSchema()

     val query = ampsStream.writeStream.format("console").start()

     query.awaitTermination()
  }
}

I've put breaks and debug statements in to test, and it gets called again right when I get to the writeStream.start. As mentioned the odd thing two is the second time around the Optional variable that is passed into createMicroBatchReader is empty, where as the first call properly has the schema. Any guidance would be greatly appreciated.

EDIT: I added some debugging statements and tested this out with the above mentioned repo @ https://github.com/hienluu/wikiedit-streaming and I see the exact same issue when running WikiEditSourceV2Example.scala from this repo. Not sure if this is a bug, or if me and the author of the aforementionned repo are missing something.

EDIT 2: Adding the code for the amps streaming source import java.util.Optional

import org.apache.spark.internal.Logging
import org.apache.spark.sql.sources.DataSourceRegister
import org.apache.spark.sql.sources.v2.reader.streaming.{MicroBatchReader, Offset}
import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, MicroBatchReadSupport}
import org.apache.spark.sql.types.StructType


class AmpsStreamingSource extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister {

  override def shortName(): String = "amps"

  override def createMicroBatchReader(schema: Optional[StructType],
                                      checkpointLocation: String,
                                      options: DataSourceOptions): MicroBatchReader = {

    println("AmpsStreamingSource.createMicroBatchReader was called")
    if(schema.isPresent) new AmpsMicroBatchReader(schema.get, options)

    throw new IllegalArgumentException("Must proivde a schema for amps stream source")
  }
}

and the signature of AmpsMicroBatchReader

class AmpsMicroBatchReader(schema: StructType, options: DataSourceOptions)
    extends MicroBatchReader with MessageHandler with Logging

@JacekLaskowski I've posted the code for the amps source. It is difficult for me to post the implementation of the AmpsMicroBatchReader as it has a good bit of spaghetti'd proprietary code and is also a few hundred lines long. I can code up a smaller example though and put on github if it would help. Essentially though, calling the Streamer code - I can see the print statement ran twice, and for the second excution it throws the IllegalArgumentException as the Optional for schema is also no longer populated. Let me know if I can provide any more info that could be relevant. — Kevin Mooney, Jul 19 '18 at 22:34
_"I can code up a smaller example though and put on github if it would help."_ that be very helpful as I'd have to do it anyway. — Jacek Laskowski, Jul 20 '18 at 04:37
@JacekLaskowski I've created a [sample project here](https://github.com/moonkev/spark-streaming-example). I took out the exception when the schema optional is empty, but I do log both the call to createMicroBatchReader as well as the value of the schema argument. If you the StreamingExample class, you will see that createMicroBatchReader is called twice for a single stream, and that the second time the schema optional is empty (I pass in a dummy schema just to test that). Hopefully my example is clear enough, but if not, let me know if you have any questions or what I can add. — Kevin Mooney, Jul 21 '18 at 19:08
Did anyone figure this out, I facing the same issue. Any leads would be appreciated — Yoyo, May 05 '20 at 06:31
@Yoyo no, Its been a while since I worked on this, but we ended up sticking with the older streaming API. I've not tried in newer versions of spark, so this may no longer be an issue. I am not sure which version you tried with, but I was working with spark 2.3 — Kevin Mooney, May 15 '20 at 21:59
@KevinMooney Yes I am working with spark 2.3, for the time being, I made the class singleton, so that it won't execute twice. By the way, what was the older streaming API that you were using? — Yoyo, May 18 '20 at 06:13
I work with spark 3.1.1, and still it loads the datasource twice — Venus, Aug 01 '21 at 12:13

Spark streaming query loads datasource twice

0 Answers0