0

I have Spark Streaming job which reads data from kafka partitions (one executor per partition).
I need to save transformed values to HDFS, but need to avoid empty files creation.
I tried to use isEmpty, but this doesn't help when not all partitions are empty.

P.S. repartition is not an acceptable solution due to perfomance degradation.

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
  • You could use Kafka Connect instead... Then you wouldn't need to write code, and you wouldn't have empty files – OneCricketeer Nov 29 '18 at 15:56
  • @cricket_007 this could work for text data, but will not for my avro pipeline which requires processing and multiple outputs. Now it works fine with LazyOutputFormat – Ruslan Ostafiichuk Dec 01 '18 at 12:33
  • Kafka connect works fine with Avro https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html – OneCricketeer Dec 01 '18 at 16:58
  • You don't have to use the Schema Registry, either https://github.com/farmdawgnation/registryless-avro-converter – OneCricketeer Dec 01 '18 at 16:59
  • @cricket_007 I have json, not avro in Kafka. I build three outputs with different content in avro for each message. I read page on confluent.io after your first comment, but still don't think it could solve my problem. – Ruslan Ostafiichuk Dec 01 '18 at 17:53
  • The HDFS Connector is also capable of accepting JSON messages with an embedded schema+payload and writing out to Avro – OneCricketeer Dec 01 '18 at 20:29
  • @cricket_007 so if I have input json like {"a": 1} can I have output avro like {"headers":{"avro_event_time":135135,"processing_hostname":"host1"},"body":{"a":1}} ? In my case it's tree different formats for different consumers. So I don't think it can be handled with Kafka HDFS Connector only. – Ruslan Ostafiichuk Dec 01 '18 at 23:04
  • The input JSON would actually need to look like `{"schema": {"type": "struct", "fields": [{"name": "a", "type": "int32", "optional": false}], "name:"root", "optional": false}, "payload": {"a":1}}`, and then the Avro (or Parquet) file in HDFS would only have `{"a": 1}`. The schema is required to convert to valid Avro types during deserialization. https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields/ You can include the Kafka EventTime with a Connect Simple Message Transform... (Or you can continue using Spark code ;) ) – OneCricketeer Dec 01 '18 at 23:45
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184580/discussion-between-ruslan-ostafiichuk-and-cricket-007). – Ruslan Ostafiichuk Dec 02 '18 at 09:37

1 Answers1

0

The code works for PairRDD only.

Code for text:

  val conf = ssc.sparkContext.hadoopConfiguration
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[TextOutputFormat[Text, NullWritable]]
    classOf[OutputFormat[Text, NullWritable]])

  kafkaRdd.map(_.value -> NullWritable.get)
    .saveAsNewAPIHadoopFile(basePath,
      classOf[Text],
      classOf[NullWritable],
      classOf[LazyOutputFormat[Text, NullWritable]],
      conf)

Code for avro:

  val avro: RDD[(AvroKey[MyEvent], NullWritable)]) = ....
  val conf = ssc.sparkContext.hadoopConfiguration

  conf.set("avro.schema.output.key", MyEvent.SCHEMA$.toString)
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[AvroKeyOutputFormat[MyEvent]],
    classOf[OutputFormat[AvroKey[MyEvent], NullWritable]])

  avro.saveAsNewAPIHadoopFile(basePath,
    classOf[AvroKey[MyEvent]],
    classOf[NullWritable],
    classOf[LazyOutputFormat[AvroKey[MyEvent], NullWritable]],
    conf)

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35