1

I have a Databricks Kafka Producer that needs to write 62M records to a Kafka topic. Will there be an issue if I write 62M records at the same time? Or do I need to iterate say 20 times and write 3M records per iteration.

Here is the code.

Cmd1 val srcDf = spark.read.format("delta").load("/mnt/data-lake/data/silver/geocodes").filter($"LastUpdateDt"===lastUpdateDt)

Cmd2 val strDf = srcDf
        .withColumn("key",...
        .withColumn("topLevelRecord",...

Cmd3 strDf
 .select(
 to_avro($"key", lit("topic-AVRO-key"), schemaRegistryAddr).as("key"),
 to_avro($"topLevelRecord", lit("topic-AVRO-value"), schemaRegistryAddr, avroSchema).as("value"))
 .write
 .format("kafka")
 .option("kafka.bootstrap.servers", bootstrapServers)
 .option("kafka.security.protocol", "SSL")
 .option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
 .option("kafka.ssl.keystore.password", keystorePassword)
 .option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
 .option("topic",topic)
 .save()

My question is - if strDf.count is 62M, can I directly write it to Kafka or I need to iterate cmd# 3.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
Don Sam
  • 525
  • 5
  • 20
  • I have added the code above. My question is - if strDf.count is 62M, can I directly write it to Kafka using Cmd3 or I need to iterate Cmd# 3 several times say 20times...and write 3M chunks of data in 1 go. – Don Sam Aug 30 '20 at 18:56
  • Thanks for insight into this Mike. The Kafka is installed On-Prem and when I tried writing to Kafka from Databricks Notebook, it timed-out. Out of 62M, 4.9M records were written. I have written a batch Kafka producer; not a spark str. streaming. I used batch because I write to the broker every day only once. Streaming would have taken care as it will send records in microbatch. In my case, I will try writing in several iterations using a loop. At least I know that using current config. at a time I can write ~4.5M records. Do you think it's the right approach? – Don Sam Sep 02 '20 at 01:18

1 Answers1

2

There is no limit for storing data into Kafka using Spark structured streaming for Kafka. You will see below, that your streaming query will create a (pool of) KafkaProducer which is used to iterate through the rows in your Dataframe. Kafka can handle such amount of messages and there is no limit.

It might be interesting to note that Kafka will buffer some messages into a batch before this batch of messages gets actually written to the brokers. This is steered through the configurations of the KafkaProducer Configs linger.ms, batch.size and max.request.size, so it might be useful to adjust those settings to your overall set-up.

Here is the code of the spark-kafka-sql library:

Internally, Spark will create a pool of KafkaProducers in InternalKafkaProducerPool.scala:

  private def createKafkaProducer(paramsSeq: Seq[(String, Object)]): Producer = {
    val kafkaProducer: Producer = new Producer(paramsSeq.toMap.asJava)
    if (log.isDebugEnabled()) {
      val redactedParamsSeq = KafkaRedactionUtil.redactParams(paramsSeq)
      logDebug(s"Created a new instance of KafkaProducer for $redactedParamsSeq.")
    }
    kafkaProducer
  }

Your query is getting then converted into an RDD and for each partition it iterates through the elements in KafkaWriter.scala:

  queryExecution.toRdd.foreachPartition { iter =>
      val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
      Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
        finallyBlock = writeTask.close())
    }
  }

The actual producing of the data will happen in KafkaWriteTask:

  def execute(iterator: Iterator[InternalRow]): Unit = {
    producer = Some(InternalKafkaProducerPool.acquire(producerConfiguration))
    val internalProducer = producer.get.producer
    while (iterator.hasNext && failedWrite == null) {
      val currentRow = iterator.next()
      sendRow(currentRow, internalProducer)
    }
  }
Michael Heil
  • 16,250
  • 3
  • 42
  • 77