13

I've had a go implementing a structured stream like so...

myDataSet
  .map(r =>  StatementWrapper.Transform(r))
  .writeStream
  .foreach(MyWrapper.myWriter)
  .start()
  .awaitTermination()

This all seems to work, but looking at the throughput of MyWrapper.myWriter is horrible. It's effectively trying to be a JDBC sink, it looks like:

val myWriter: ForeachWriter[Seq[String]] = new ForeachWriter[Seq[String]] {

  var connection: Connection = _

  override def open(partitionId: Long, version: Long): Boolean = {
    Try (connection = getRemoteConnection).isSuccess
  }

  override def process(row: Seq[String]) {
    val statement = connection.createStatement()
    try {
      row.foreach( s => statement.execute(s) )
    } catch {
      case e: SQLSyntaxErrorException => println(e)
      case e: SQLException => println(e)
    } finally {
      statement.closeOnCompletion()
    }
  }

  override def close(errorOrNull: Throwable) {
    connection.close()
  }
}

So my question is - Is the new ForeachWriter instantiated for each row ? thus the open() and close() is called for every row in the dataset ?

Is there a better design to improve throughput ?

How to parse SQL statement once and execute many times, also keep the database connection open?

zero323
  • 322,348
  • 103
  • 959
  • 935
Exie
  • 466
  • 5
  • 16

1 Answers1

12

Opening and closing of the underlying sink depends on your implementation of ForeachWriter.

The relevant class which invokes ForeachWriter is the ForeachSink, and this is the code which calls your writer:

data.queryExecution.toRdd.foreachPartition { iter =>
  if (writer.open(TaskContext.getPartitionId(), batchId)) {
    try {
      while (iter.hasNext) {
        writer.process(encoder.fromRow(iter.next()))
      }
    } catch {
      case e: Throwable =>
        writer.close(e)
        throw e
    }
    writer.close(null)
  } else {
    writer.close(null)
  }
}

Opening and closing of the writer is attempted foreach batch that is generated from your source. If you want open and close to be literally open and close the sink driver each time, you'll need to do so via your implementation.

If you want more control over how the data is handled, you can implement the Sink trait which gives a batch id and the underlying DataFrame:

trait Sink {
  def addBatch(batchId: Long, data: DataFrame): Unit
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks a lot for this. Is there anyway you can explain or point to an article which describes what to consider when implementing a Sink in all three modes (append, update and complete) ? – user1870400 May 19 '18 at 09:11
  • My question really is how does outputModes and Sink interface interplay? – user1870400 May 19 '18 at 09:59
  • @user1870400 AFAIK, the Sink gets the relative part of the data according to the output mode you've set. For example, if you only want updates, then the Sink would only ever get the elements updated per this job iteration. – Yuval Itzchakov May 20 '18 at 08:23
  • What does batch mean here, more details about batch would be great? looks like open,close called only once for each partition. – spats Oct 27 '18 at 00:27
  • ok, batch here is microbatch id in streaming. For more details check here https://docs.azuredatabricks.net/spark/latest/structured-streaming/foreach.html#execution-semantics – spats Oct 27 '18 at 06:47