Single source multiple sinks v/s flatmap

Question

I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.

I tried achieving this using multiple sinks.

private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
        OutputFileConfig config = OutputFileConfig
                .builder()
                .withPartSuffix(".snappy.parquet")
                .build();


        final StreamingFileSink<T> sink = StreamingFileSink
                .forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
                .withBucketAssigner(new S3BucketAssigner<T>())
                .withOutputFileConfig(config)
                .withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
                .build();
        return sink;
}

public static void main(String[] args) throws Exception {
    DataStream<PIData> input = createSourceFromStaticConfig(env)
        .map(new JsonToSourceDataMap())
        .name("jsonToInputDataTransformation");


    input.map(value -> value)
        .name("rawData")
        .addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
        .name("s3Sink");

     input.map(FirstConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));

    input.map(SecondConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));

    input.map(ThirdConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));

    //and so on; There are around 10 buckets.
}

However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.

Other notes: I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.

As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better? Would using a flatmap operator and then having a single sink be better?

score 0 · Answer 1 · answered Jan 29 '21 at 13:21

When you had a very simple pipeline with a single source and a single sink, something like this:

source -> map -> sink

then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).

I don't see how using a flatmap would make any difference.

You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.

Thanks for the suggestion. Is there a way to measure the serialization/deserialization time with KDA? — Vipul, Jan 30 '21 at 20:04

Single source multiple sinks v/s flatmap

1 Answers1