0

I am trying to read data from a table through SparkSession, and publish it to a Kafka topic. Using below piece of code for the same:

    import org.apache.avro.Schema
    import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
    import org.apache.avro.specific.SpecificDatumWriter
    import org.apache.avro.io._
    import org.apache.kafka.clients.CommonClientConfigs
    import org.apache.kafka.clients.producer._
    import org.apache.kafka.common.serialization.StringSerializer
    import org.apache.kafka.common.serialization.ByteArraySerializer
    import java.io.{ByteArrayOutputStream, StringWriter} 

object Producer extends Serializable {

  def main(args: Array[String]): Unit = {
        val props = new Properties()
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)

        val lines= Source.fromFile("file")
        val schema = new Schema.Parser().parse(lines)

        val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()

        import spark.implicits._
        val df = spark.sql("select * from table")

        df.rdd.map{
            value => {
              val prod = new KafkaProducer[String, Array[Byte]](props)

        val records = new GenericData.Record(schema)
              records.put("col1",value.getString(1))
              records.put("col2",value.getString(2))
              records.put("col3",value.getString(3))
              records.put("col4",value.getString(4))

        val writer = new SpecificDatumWriter[GenericRecord](schema)
              val out = new ByteArrayOutputStream()
              val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
              writer.write(records, encoder)
              encoder.flush()
              out.close()

        val serializedBytes: Array[Byte] = out.toByteArray()
        val record = new ProducerRecord("topic",col1.toString , serializedBytes)
        val data = prod.send(record)

        prod.flush()
        prod.close() }  
                  }

        spark.close()
     }
}

And, below error is thrown when I execute it:

Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema Serialization stack: - object not serializable (class: org.apache.avro.Schema$RecordSchema, value: {"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]})

field (class: scala.runtime.ObjectRef, name: elem, type: class java.lang.Object) object (class scala.runtime.ObjectRef, {"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]}) - field (class: com.kafka.driver.KafkaProducer.Producer$$anonfun$main$1, name: schema$1, type: class scala.runtime.ObjectRef)

However, it runs fine when I try to pass the dataset to driver using df.rdd.collect.foreach . Instead, I need to publish the messages at cluster level, thus using rdd.map . Not sure what am I missing here exactly which is causing this error. Any help towards resolving this would be highly appreciated, thanks!

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Mohit Sudhera
  • 341
  • 1
  • 4
  • 16
  • Why do you need to map the RDD? Are you required to use Avro? You're just calling send on the Producer, not changing the RDD content. Also, Hive tables are typically the destination of Kafka data, not the source – OneCricketeer Nov 18 '19 at 00:31
  • @cricket_007 , if I don't change Dataset to an RDD, it throws: **Unable to find encoder for type stored in a Dataset** exception. And yes, I am required to use AVRO format. Aslo, in my use case, source is a hive table, from which I am reading data and pushing to Kafka topic. – Mohit Sudhera Nov 18 '19 at 02:12
  • I guess my point was that you could just use JDBC to read from Hive, then use a standard Kafka producer. Spark may have libraries included for that, but it's a bit overkill for such a simple use case, plus doesn't need to be distributed – OneCricketeer Nov 18 '19 at 07:35
  • Lunatech recently issued a blog article which I believe would help solving your issue: https://www.lunatech.com/blog/Xc51ORQAACEAev0k/lessons-learned-using-spark-structured-streaming – Gaarv Nov 18 '19 at 13:22
  • @cricket_007, since the source is a hive table having more than 1000 columns and a huge set of records (~ 2M a day need to be processed), wanted to achieve as much parallelism as possible while transporting the data over network. Thus, using spark for that. The issue has been resolved though. – Mohit Sudhera Nov 20 '19 at 15:19

1 Answers1

0

Figured out that objects, Schema and Kafka Producer, need to be exposed to executors. For that modified the above code as:

    import org.apache.avro.Schema
    import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
    import org.apache.avro.specific.SpecificDatumWriter
    import org.apache.avro.io._
    import org.apache.kafka.clients.CommonClientConfigs
    import org.apache.kafka.clients.producer._
    import org.apache.kafka.common.serialization.StringSerializer
    import org.apache.kafka.common.serialization.ByteArraySerializer
    import java.io.{ByteArrayOutputStream, StringWriter} 

object Producer extends Serializable {

  def main(args: Array[String]): Unit = {
        val props = new Properties()
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)

        val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()

        import spark.implicits._
        val df = spark.sql("select * from table")

        df.foreachPartition{
            rows => {
              val prod = new KafkaProducer[String, Array[Byte]](props)
              val lines= Source.fromFile("file")
              val schema = new Schema.Parser().parse(lines)
            rows.foreach{
                  value =>
                      val records = new GenericData.Record(schema)
                      records.put("col1",value.getString(1))
                      records.put("col2",value.getString(2))
                      records.put("col3",value.getString(3))
                      records.put("col4",value.getString(4))

                      val writer = new SpecificDatumWriter[GenericRecord](schema)
                      val out = new ByteArrayOutputStream()
                      val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
                      writer.write(records, encoder)
                      encoder.flush()
                      out.close()

                      val serializedBytes: Array[Byte] = out.toByteArray()
                      val record = new ProducerRecord("topic",col1.toString , serializedBytes)
                      val data = prod.send(record)
                     }
        prod.flush()
        prod.close() 
                 }  
               }

        spark.close()
     }
}
Mohit Sudhera
  • 341
  • 1
  • 4
  • 16