Beam pipeline: Kafka to HDFS by time buckets

Question

I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing time).

No assumption can be made about the timestamp of the events (they could be spanning through multiple days even if 99% of the time they are in real-time) and there is absolutely no information about the order of the events. My first attempt is to create a pipeline running in processing time.

My pipeline looks like this:

val kafkaReader = KafkaIO.read[String, String]()
  .withBootstrapServers(options.getKafkaBootstrapServers)
  .withTopic(options.getKafkaInputTopic)
  .withKeyDeserializer(classOf[StringDeserializer])
  .withValueDeserializer(classOf[StringDeserializer])
  .updateConsumerProperties(
    ImmutableMap.of("receive.buffer.bytes", Integer.valueOf(16 * 1024 * 1024))
  )
  .commitOffsetsInFinalize()
  .withoutMetadata()

val keyed = p.apply(kafkaReader)
  .apply(Values.create[String]())
  .apply(new WindowedByWatermark(options.getBatchSize))
  .apply(ParDo.of[String, CustomEvent](new CustomEvent))

val outfolder = FileSystems.matchNewResource(options.getHdfsOutputPath, true)

    keyed.apply(
  "write to HDFS",
  FileIO.writeDynamic[Integer, CustomEvent]()
    .by(new SerializableFunction[CustomEvent, Integer] {
      override def apply(input: CustomEvent): Integer = {
        new Instant(event.eventTime * 1000L).toDateTime.withMinuteOfHour(0).withSecondOfMinute(0)
        (eventZeroHoured.getMillis / 1000).toInt
      }
    })
    .via(Contextful.fn(new SerializableFunction[CustomEvent, String] {
      override def apply(input: CustomEvent): String = {
        convertEventToStr(input)
      }
    }), TextIO.sink())
    .withNaming(new SerializableFunction[Integer, FileNaming] {
      override def apply(bucket: Integer): FileNaming = {
        new BucketedFileNaming(outfolder, bucket, withTiming = true)
      }
    })
    .withDestinationCoder(StringUtf8Coder.of())
    .to(options.getHdfsOutputPath)
    .withTempDirectory("hdfs://tlap/tmp/gulptmp")
    .withNumShards(1)
    .withCompression(Compression.GZIP)
)

And this is my WindowedByWatermark:

class WindowedByWatermark(bucketSize: Int = 5000000) extends PTransform[PCollection[String], PCollection[String]] {

  val window: Window[String] = Window
    .into[String](FixedWindows.of(Duration.standardMinutes(10)))
    .triggering(
      AfterWatermark.pastEndOfWindow()
        .withEarlyFirings(AfterPane.elementCountAtLeast(bucketSize))
    )
    .withAllowedLateness(Duration.standardMinutes(30))
    .discardingFiredPanes()

  override def expand(input: PCollection[String]): PCollection[String] = {
    input.apply("window", window)
  }
}

The pipeline runs flawlessly but it is suffering from incredibly high back pressure due to the write phase (the groupby caused by the writeDynamic). Most of the events are coming in real-time, hence they belong to the same hour. I tried also bucketing the data using hour and minutes, without much help.

After days of pain, I have decided to replicate the same with Flink using a bucketingSink and the performance is excellent.

val stream = env
  .addSource(new FlinkKafkaConsumer011[String](options.kafkaInputTopic, new SimpleStringSchema(), properties))
  .addSink(bucketingSink(options.hdfsOutputPath, options.batchSize))

According to my analysis (even using JMX), the threads in Beam are waiting during the write phase to HDFS (and this causes the pipeline to pause the retrieval of data from Kafka).

I have therefore the following questions:

Is it possible to push down the bucketing as the bucketingSink is doing also in Beam?
Is there a smarter way to achieve the same in Beam?

Curious - Have you tried Kafka HDFS Connector instead, which has several options for time bucketing (and Hive Integration)? — OneCricketeer, Jan 08 '19 at 18:45
@cricket_007 what is the connector you are referring to? Are you referring to the Confluent one? If so, how does that relate to Beam? — Fabiano Francesconi, Jan 09 '19 at 09:03
Yes Confluent. It doesn't relate to Beam. You seem to just be asking for getting timestamped sections of HDFS data from Kafka. If you already have a Kafka installation, then you already have Kafka Connect (you just might not be running it). Therefore, there is no need to write Beam code, if Connect has been evaluated to do what you want. — OneCricketeer, Jan 09 '19 at 18:06

Beam pipeline: Kafka to HDFS by time buckets

0 Answers0