I am trying to produce a dataframe to Kafka Topic using Spark Kafka in Java.
I am able to produce the data if i am iterating over the rows in the dataframe, extracting the key column and value column from the dataframe and producing it as below:
Map<String, Object> kafkaParameters = new HashMap<>();
kafkaParameters.put(<All Kafka Params>);
finalDataframe.foreach( row -> {
Producer<String, String> producer = new KafkaProducer<String, String>(kafkaParameters);
ProducerRecord<String, String> producerRec= new ProducerRecord<>("<TOPIC_NAME>", row.getAs("columnNameForMsgKey"), row.getAs("columnNameForMsgValue"));
producer.send(producerRec);
});
I do not want to use the above method, because for each row it is creating a new Producer instance to write it which will impact the performance as the dataset is huge.
Instead i tried writing the entire dataframe in one go using the below method:
finalDataframe.selectExpr("CAST(columnNameForMsgKey AS STRING) as key", "CAST(columnNameForMsgValue AS STRING) as value")
.write()
.format("kafka")
.option("kafka.bootstrap.servers", "<SERVER_NAMES>")
.option("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
.option("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
.option("security.protocol", "SASL_PLAINTEXT")
.option("sasl.kerberos.service.name", "kafka")
.option("sasl.mechanism", "GSSAPI")
.option("acks", "all")
.option("topic", "<TOPIC_NAME>")
.save();
But the method throws below exception:
THROWS org.apache.kafka.common.errors.TimeoutException: Topic TOPIC_NAME not present in metadata
Entire stacktrace is:
20/02/01 23:04:30 INFO SparkContext: SparkContext already stopped.
20/02/01 23:04:30 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:87)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:206)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:264)
at CustomProducer.main(CustomProducer.java:508)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.
20/02/01 23:04:30 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.
Please help in finding what is the issue or suggest alternative to produce the entire dataframe to the topic instead of producing each row
N.B. The Kafka message key and value to be produced is present as two different columns in the finalDataframe
Thanks