I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read
) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing time).
No assumption can be made about the timestamp of the events (they could be spanning through multiple days even if 99% of the time they are in real-time) and there is absolutely no information about the order of the events. My first attempt is to create a pipeline running in processing time.
My pipeline looks like this:
val kafkaReader = KafkaIO.read[String, String]()
.withBootstrapServers(options.getKafkaBootstrapServers)
.withTopic(options.getKafkaInputTopic)
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.updateConsumerProperties(
ImmutableMap.of("receive.buffer.bytes", Integer.valueOf(16 * 1024 * 1024))
)
.commitOffsetsInFinalize()
.withoutMetadata()
val keyed = p.apply(kafkaReader)
.apply(Values.create[String]())
.apply(new WindowedByWatermark(options.getBatchSize))
.apply(ParDo.of[String, CustomEvent](new CustomEvent))
val outfolder = FileSystems.matchNewResource(options.getHdfsOutputPath, true)
keyed.apply(
"write to HDFS",
FileIO.writeDynamic[Integer, CustomEvent]()
.by(new SerializableFunction[CustomEvent, Integer] {
override def apply(input: CustomEvent): Integer = {
new Instant(event.eventTime * 1000L).toDateTime.withMinuteOfHour(0).withSecondOfMinute(0)
(eventZeroHoured.getMillis / 1000).toInt
}
})
.via(Contextful.fn(new SerializableFunction[CustomEvent, String] {
override def apply(input: CustomEvent): String = {
convertEventToStr(input)
}
}), TextIO.sink())
.withNaming(new SerializableFunction[Integer, FileNaming] {
override def apply(bucket: Integer): FileNaming = {
new BucketedFileNaming(outfolder, bucket, withTiming = true)
}
})
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getHdfsOutputPath)
.withTempDirectory("hdfs://tlap/tmp/gulptmp")
.withNumShards(1)
.withCompression(Compression.GZIP)
)
And this is my WindowedByWatermark:
class WindowedByWatermark(bucketSize: Int = 5000000) extends PTransform[PCollection[String], PCollection[String]] {
val window: Window[String] = Window
.into[String](FixedWindows.of(Duration.standardMinutes(10)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(bucketSize))
)
.withAllowedLateness(Duration.standardMinutes(30))
.discardingFiredPanes()
override def expand(input: PCollection[String]): PCollection[String] = {
input.apply("window", window)
}
}
The pipeline runs flawlessly but it is suffering from incredibly high back pressure due to the write phase (the groupby caused by the writeDynamic
). Most of the events are coming in real-time, hence they belong to the same hour. I tried also bucketing the data using hour and minutes, without much help.
After days of pain, I have decided to replicate the same with Flink using a bucketingSink
and the performance is excellent.
val stream = env
.addSource(new FlinkKafkaConsumer011[String](options.kafkaInputTopic, new SimpleStringSchema(), properties))
.addSink(bucketingSink(options.hdfsOutputPath, options.batchSize))
According to my analysis (even using JMX), the threads in Beam are waiting during the write phase to HDFS (and this causes the pipeline to pause the retrieval of data from Kafka).
I have therefore the following questions:
- Is it possible to push down the bucketing as the
bucketingSink
is doing also in Beam? - Is there a smarter way to achieve the same in Beam?