1

I have installed kafka locally (no cluster/schema registry for now) and trying to produce an Avro topic and below is the schema associated with that topic.

{
  "type" : "record",
  "name" : "Customer",
  "namespace" : "com.example.Customer",
  "doc" : "Class: Customer",
  "fields" : [ {
    "name" : "name",
    "type" : "string",
    "doc" : "Variable: Customer Name"
  }, {
    "name" : "salary",
    "type" : "double",
    "doc" : "Variable: Customer Salary"
  } ]
}

I would like to create a simple SparkProducerApi to create some data based on the above schema and publish it to kafka. Thinking of creating sample data converting to dataframe and then change it to avro and then publish it.

val df = spark.createDataFrame(<<data>>)

And then, something like below:

df.write
  .format("kafka")
  .option("kafka.bootstrap.servers","localhost:9092")
  .option("topic","customer_avro_topic")
  .save()
}

Attaching schema to this avro topic can be done manually for now.

Can this be done just by using Apache Spark APIs instead of using Java/Kafka Apis? This is for batch processing instead of streaming.

Leibnitz
  • 355
  • 5
  • 19

1 Answers1

0

I don't think this is directly possible because the Kafka producer in Spark expects two columns of key and value, both of which must be byte arrays.

If you read an existing Avro file from disk, the Avro dataframe reader you have likely creates two columns for the name and salary. Therefore, you would need one operation to construct a value column from the others containing the whole Avro record, then drop those other columns, and then you must serialize it to a byte array using a library like Bijection, for example since you're not using the Schema Registry.

If you want to generate data and don't have a file, then you'd need to build a list of Tuple2 objects for the Kafka message key and values that are byte arrays, then you could parallelize those to an RDD, then convert them into a Dataframe. But at that point, just using the regular Kafka Producer API is much simpler.

Plus, if you already know your schema, try the project mentioned in Ways to generate test data in Kafka

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks @@cricket_007. Ok well, then assuming I already have this `customer-avro-topic` with associated schema `customer.avsc` on kafka would the reverse(consuming) be atleast possible using `Spark Consumer Apis`. Also saw some functions `from_avro` and `to_avro` since version `2.4.0` (Not sure if this is stable enough to support all avro features). If you say even this is NOT directly possible then that means we can't just simply rely on Spark Apis for `consuming/producing` `avro topics from kafka` if Im not wrong. Do you think so? – Leibnitz Apr 21 '19 at 16:52
  • I believe those functions are meant for Avro files. Not Kafka events, but yes, you'd need to deserialize the bytes to Avro objects – OneCricketeer Apr 21 '19 at 20:00