0

I have a Flink batch job which reads from kafka and writes to S3. The current strategy of this job is to read

From: timestamp To: timestamp.

So I basically have my Kafka consumer as follows:

KafkaSource.<T>builder()
                .setBootstrapServers(resolvedBootstrapBroker)
                .setTopics(List.of("TOPIC_0"))
                .setGroupId(consumerGroupId)
                .setStartingOffsets(OffsetsInitializer.timestamp(startTimeStamp))
                .setValueOnlyDeserializer(deserializationSchema)
                .setBounded(OffsetsInitializer.timestamp(endTimeStamp))
                .setProperties(additionalProperties)
                .build();

Start timestamp and end timestamp are calculated as follows(from 10 days ago to 10 hours ago):


        long startTimeStamp = Instant.now().minus(10, ChronoUnit.DAYS).toEpochMilli();
        long endTimeStamp = Instant.now().minus(10, ChronoUnit.HOURS).toEpochMilli();

However, the records are not written to S3. If I just switch the bounded parameter as:

                .setBounded(OffsetsInitializer.latest())

it works and writes to S3. Any idea what I might be doing wrong?

EDIT:

I learnt that it is writing the partial file. But it is not converting the partial file to full file. Any idea why that might be happening?

Vinod Mohanan
  • 3,729
  • 2
  • 17
  • 25

1 Answers1

0

Flink's FileSink only commits results during checkpointing (or at the end of the batch when operating on bounded inputs). See this note in the documentation:

IMPORTANT: Checkpointing needs to be enabled when using the FileSink in STREAMING mode. Part files can only be finalized on successful checkpoints. If checkpointing is disabled, part files will forever stay in the in-progress or the pending state, and cannot be safely read by downstream systems.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Hey David, I am running in batch mode. And looks like in batch mode checkpoints are disregarded: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/#checkpointing – Vinod Mohanan Aug 30 '22 at 10:28
  • I also ran some experiments in streaming mode. When I run with a timestamp offset, the job never completes. But when I run the same with latest offset it does complete. So in other words, when I run with a timestamp end offset in batch mode, the intermediate files are not completed and the job is also not completed. While in the streaming mode the job is not completed. Given this is a bounded source, these should happen, correct? – Vinod Mohanan Aug 30 '22 at 10:36