0

I created some archive data files on HDFS with Apache Flink, the generated file name has pattern like part-{parallel-task}-{count} but what I expected should have ".gz" suffix which can be loaded directly by Apache Spark.

I can't find any API to add suffix to final completed file generated by BucketingSink in Apache Flink, but can only add suffix to InProgress, Pending and ValidLength state. Anyone can help? HDFS Connector & Java API

Casel Chen
  • 497
  • 2
  • 8
  • 19

1 Answers1

0

As far as I can see, there is no option to add a suffix using the default BucketingSink.

One option would be not to use checkpointing and to set the pending suffix to the desired suffix. But since checkpointing is desirable in most cases, this isn't optimal.

My solution was to create a BucketingSinkWithSuffix implementation which is almost an exact copy of the default BucketingSink. The only things which need to be changed is adding a member variable for the suffix which can be set in the constructor and to adjust the way the base path is created.

Here's my implementation for the constructor:

    public BucketingSinkWithSuffix(String basePath, String suffix) {
    this.basePath = basePath;
    this.bucketer = new DateTimeBucketer<>();
    this.writerTemplate = new StringWriter<>();
    this.partSuffix = suffix;
}

And for generating the base path (lines 523 and 528):

partPath = new Path(bucketPath, partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter + partSuffix);
RemiM
  • 66
  • 2