I have a Databricks Kafka Producer that needs to write 62M records to a Kafka topic. Will there be an issue if I write 62M records at the same time? Or do I need to iterate say 20 times and write 3M records per iteration.
Here is the code.
Cmd1 val srcDf = spark.read.format("delta").load("/mnt/data-lake/data/silver/geocodes").filter($"LastUpdateDt"===lastUpdateDt)
Cmd2 val strDf = srcDf
.withColumn("key",...
.withColumn("topLevelRecord",...
Cmd3 strDf
.select(
to_avro($"key", lit("topic-AVRO-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-AVRO-value"), schemaRegistryAddr, avroSchema).as("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.save()
My question is - if strDf.count is 62M, can I directly write it to Kafka or I need to iterate cmd# 3.