I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes: I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better? Would using a flatmap operator and then having a single sink be better?