Avoid write files for empty partitions in Spark Streaming

Question

I have Spark Streaming job which reads data from kafka partitions (one executor per partition).
I need to save transformed values to HDFS, but need to avoid empty files creation.
I tried to use isEmpty, but this doesn't help when not all partitions are empty.

P.S. repartition is not an acceptable solution due to perfomance degradation.

You could use Kafka Connect instead... Then you wouldn't need to write code, and you wouldn't have empty files — OneCricketeer, Nov 29 '18 at 15:56
@cricket_007 this could work for text data, but will not for my avro pipeline which requires processing and multiple outputs. Now it works fine with LazyOutputFormat — Ruslan Ostafiichuk, Dec 01 '18 at 12:33
Kafka connect works fine with Avro https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html — OneCricketeer, Dec 01 '18 at 16:58
You don't have to use the Schema Registry, either https://github.com/farmdawgnation/registryless-avro-converter — OneCricketeer, Dec 01 '18 at 16:59
@cricket_007 I have json, not avro in Kafka. I build three outputs with different content in avro for each message. I read page on confluent.io after your first comment, but still don't think it could solve my problem. — Ruslan Ostafiichuk, Dec 01 '18 at 17:53
The HDFS Connector is also capable of accepting JSON messages with an embedded schema+payload and writing out to Avro — OneCricketeer, Dec 01 '18 at 20:29
@cricket_007 so if I have input json like {"a": 1} can I have output avro like {"headers":{"avro_event_time":135135,"processing_hostname":"host1"},"body":{"a":1}} ? In my case it's tree different formats for different consumers. So I don't think it can be handled with Kafka HDFS Connector only. — Ruslan Ostafiichuk, Dec 01 '18 at 23:04
The input JSON would actually need to look like `{"schema": {"type": "struct", "fields": [{"name": "a", "type": "int32", "optional": false}], "name:"root", "optional": false}, "payload": {"a":1}}`, and then the Avro (or Parquet) file in HDFS would only have `{"a": 1}`. The schema is required to convert to valid Avro types during deserialization. https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields/ You can include the Kafka EventTime with a Connect Simple Message Transform... (Or you can continue using Spark code ;) ) — OneCricketeer, Dec 01 '18 at 23:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184580/discussion-between-ruslan-ostafiichuk-and-cricket-007). — Ruslan Ostafiichuk, Dec 02 '18 at 09:37

Ruslan Ostafiichuk · Accepted Answer · 2018-12-03T10:03:47.410

The code works for PairRDD only.

Code for text:

  val conf = ssc.sparkContext.hadoopConfiguration
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[TextOutputFormat[Text, NullWritable]]
    classOf[OutputFormat[Text, NullWritable]])

  kafkaRdd.map(_.value -> NullWritable.get)
    .saveAsNewAPIHadoopFile(basePath,
      classOf[Text],
      classOf[NullWritable],
      classOf[LazyOutputFormat[Text, NullWritable]],
      conf)

Code for avro:

  val avro: RDD[(AvroKey[MyEvent], NullWritable)]) = ....
  val conf = ssc.sparkContext.hadoopConfiguration

  conf.set("avro.schema.output.key", MyEvent.SCHEMA$.toString)
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[AvroKeyOutputFormat[MyEvent]],
    classOf[OutputFormat[AvroKey[MyEvent], NullWritable]])

  avro.saveAsNewAPIHadoopFile(basePath,
    classOf[AvroKey[MyEvent]],
    classOf[NullWritable],
    classOf[LazyOutputFormat[AvroKey[MyEvent], NullWritable]],
    conf)

Avoid write files for empty partitions in Spark Streaming

1 Answers1