0

I am trying to produce a dataframe to Kafka Topic using Spark Kafka in Java.

I am able to produce the data if i am iterating over the rows in the dataframe, extracting the key column and value column from the dataframe and producing it as below:

Map<String, Object> kafkaParameters = new HashMap<>();
kafkaParameters.put(<All Kafka Params>);

finalDataframe.foreach( row -> {
    Producer<String, String> producer = new KafkaProducer<String, String>(kafkaParameters);
    ProducerRecord<String, String> producerRec= new ProducerRecord<>("<TOPIC_NAME>", row.getAs("columnNameForMsgKey"), row.getAs("columnNameForMsgValue"));
    producer.send(producerRec);
});

I do not want to use the above method, because for each row it is creating a new Producer instance to write it which will impact the performance as the dataset is huge.

Instead i tried writing the entire dataframe in one go using the below method:

       finalDataframe.selectExpr("CAST(columnNameForMsgKey AS STRING) as key", "CAST(columnNameForMsgValue AS STRING) as value")
                    .write()
                    .format("kafka")
                    .option("kafka.bootstrap.servers", "<SERVER_NAMES>")
                    .option("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
                    .option("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
                    .option("security.protocol", "SASL_PLAINTEXT")
                    .option("sasl.kerberos.service.name", "kafka")
                    .option("sasl.mechanism", "GSSAPI")
                    .option("acks", "all")
                    .option("topic", "<TOPIC_NAME>")
                    .save();

But the method throws below exception:

THROWS org.apache.kafka.common.errors.TimeoutException: Topic TOPIC_NAME not present in metadata
Entire stacktrace is:
20/02/01 23:04:30 INFO SparkContext: SparkContext already stopped.
20/02/01 23:04:30 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
    at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:87)
    at org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:206)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:264)
    at CustomProducer.main(CustomProducer.java:508)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.
20/02/01 23:04:30 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 131 in stage 266.0 failed 4 times, most recent failure: Lost task 131.3 in stage 266.0 (TID 4664, servername.com, executor 1): org.apache.kafka.common.errors.TimeoutException: Topic <TOPIC_NAME> not present in metadata after 60000 ms.

Please help in finding what is the issue or suggest alternative to produce the entire dataframe to the topic instead of producing each row

N.B. The Kafka message key and value to be produced is present as two different columns in the finalDataframe

Thanks

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user1326784
  • 627
  • 3
  • 11
  • 31
  • 1
    Does topic `TOPIC_NAME` exist? `bin/kafka-topics.sh --list --zookeeper :2181` Also, can you share the full error trace and the content of `server.properties`? – Giorgos Myrianthous Feb 03 '20 at 22:24
  • You're supposed to use foreachPartition, create the producer, then for each RDD in the partition, send the record... And sending a whole dataframe isn't what you want either. It's better to do the rows – OneCricketeer Feb 03 '20 at 23:31
  • @GiorgosMyrianthous Yes the topic name exists, otherwise the first method(iterating over the rows) would not have produced data to the topic, even i am able to consume from the same topic. But producing the dataframe is giving the mentioned error. Also i have included the full stacktrace in my question for reference – user1326784 Feb 04 '20 at 01:17
  • @cricket_007 As apache spark dataframe already provides an inbuilt method, i want to use it and produce the dataframe directly to the topic, instead of converting to RDD and then doing foreach and extracting the key and value and producing it, which adds more steps. Also even in case i use foreachPartition, ultimately i will be iterating over the rows and producing record by record, not by partition – user1326784 Feb 04 '20 at 01:27
  • The dataframe writer also goes row by row. It's not clear what you're expecting otherwise. Kafka has a default max message size of 1MB, and if your data is smaller than that, you probably don't need Spark. Anyways, please show the command you use to verify topic existence – OneCricketeer Feb 04 '20 at 05:25
  • @cricket_007 Do you mean there will be one Producer instance created for each row/record/message irrespective of any method we use to produce the data and we cannot write a whole rdd or dataframe to the topic using Kafka producer? Because in my first method while looping through the RDD, a new Producer instance is getting created for each record.Please advise. Also this is the command i used to verify the topic existence and it exists: $ /bin/kafka-topics.sh --list --zookeeper :2181||grep -i – user1326784 Feb 05 '20 at 00:16
  • No, one Producer instance will be made per partition ([when you use `foreachPartition`](https://stackoverflow.com/a/40501663/2308683)). A Spark partition exists in a single JVM instance. Many partitions are spread over multiple machines and CPU cores. When you do `partition.foreachRDD`, then you iterate over the "rows" of the RDD and send them to Kafka. And if you are on a recent kafka version, can you use `--bootstrap-server` in that command? – OneCricketeer Feb 05 '20 at 00:18
  • @cricket_007 can you please share an example of using partition.foreachRDD, where i can extract the key and value from the dataframe, because my final result is a dataframe and it has two columns- one for kafka message key and other column for kafka message value – user1326784 Feb 05 '20 at 03:43
  • Well, if you do df.toRDD, how would you access the columns there? Shouldn't be any different. Either way, that doesn't solve the error in your post... – OneCricketeer Feb 05 '20 at 03:52
  • @cricket_007 That is what my question is. Either i need a solution for extracting the key and value from dataframe, loop through the partitions and produce each row OR resolve the issue to produce the dataframe directly.(As per your statement, i understand that kafka can write only one row to the producer at a time, not an entire rdd or partition) – user1326784 Feb 05 '20 at 20:39
  • Try `df.foreachPartition((partitions: Iterator[Row])`, then you can do `row.getAs[String]("key")`. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row – OneCricketeer Feb 05 '20 at 20:53

0 Answers0