1

I am a new bee to flink and facing some challenges to solve the below use case

Use Case description:

I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.

Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.

Immediately after completion of data reading this file needs to be moved to another folder historic folder.

Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.

I googled a lot but did not find anything so can you guide me to achieve this.

Let me know if anything else is required.

MiniSu
  • 566
  • 1
  • 6
  • 22

2 Answers2

2

Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)

Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -

addSource(monitoringFunction, sourceName, null, boundedness)
                        .transform("Split Reader: " + sourceName, typeInfo, factory);

ContinuousFileMonitoringFunction is actually a place where most of the logic happens.

So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.

Given that the run method performs the read and forwarding inside the checkpointLock -

synchronized (checkpointLock) {
    monitorDirAndForwardSplits(fileSystem, context);
}

I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.

That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:

public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
  ...

   @Override
   public void notifyCheckpointComplete(long checkpointId) throws Exception {
          Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
        // do move logic
     }

   /**
     * Returns the paths of the files already processed.
     *
     * @param fileSystem The filesystem where the monitored directory resides.
     */
    private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {

        final FileStatus[] statuses;
        try {
            statuses = fileSystem.listStatus(path);
        } catch (IOException e) {
            // we may run into an IOException if files are moved while listing their status
            // delay the check for eligible files in this case
            return Collections.emptyMap();
        }

        if (statuses == null) {
            LOG.warn("Path does not exist: {}", path);
            return Collections.emptyMap();
        } else {
            Map<Path, FileStatus> files = new HashMap<>();
            // handle the new files
            for (FileStatus status : statuses) {
                if (!status.isDir()) {
                    Path filePath = status.getPath();
                    long modificationTime = status.getModificationTime();
                    if (shouldIgnore(filePath, modificationTime)) {
                        files.put(filePath, status);
                    }
                } else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
                    files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
                }
            }
            return files;
        }
    }
}

and then define the data stream manually with the custom function:

ContinuousFileMonitoringFunction<OUT> monitoringFunction =
                new ArchivingContinuousFileMonitoringFunction <>(
                        inputFormat, monitoringMode, getParallelism(), interval);

ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);

final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;

env.addSource(monitoringFunction, sourceName, null, boundedness)
                        .transform("Split Reader: " + sourceName, typeInfo, factory);
Mikalai Lushchytski
  • 1,563
  • 1
  • 9
  • 18
  • Note that ContinuousFileMonitoringFunction is an internal class that might be changed at any time. This part of Flink is undergoing some rework at the moment, with https://ci.apache.org/projects/flink/flink-docs-stable/api/java/org/apache/flink/connector/file/src/FileSource.html being the new file source. – David Anderson Aug 13 '21 at 13:06
  • @Mikalai Lushchytski : I understand your point but as David Anderson suggested ContinuousFileMonitoringFunction may so is there any other way to achieve this? This is really needed for my project so asking. – MiniSu Aug 19 '21 at 13:47
  • @SamirVasani, are you going to frequently upgrade the Flink version in your project ? For example, I had also extended from internal Flink class couple of years ago and the job is still running quite stable, so not sure whether this is really a big issue. You can simply copy the source of this file and extend with your functionality, making just a new function extending from RichSourceFunction. THis way you won't depend on any api changes across the versions. – Mikalai Lushchytski Aug 19 '21 at 13:55
  • 1
    @Mikalai Lushchytski : No i am not going to update it. Let me try this. – MiniSu Aug 19 '21 at 14:24
  • @Mikalai Lushchytski : I tried with above example in which will work if i want to move file at any time after file arrives . But my use case is "the file must be moved to another folder as soon as data reading completes ". Also In my case it is a bounded reading means once the file arrives at the input folder it will have nothing new data to be appended.So can you slightly modified above example. – MiniSu Jan 10 '22 at 11:16
  • any updates regarding this question? I have the same requirement – dodo Jan 22 '22 at 01:00
  • @dodo: you need to write your own logic to implement file movement.You can refer Mikalai's answer. – MiniSu Jan 31 '22 at 10:27
  • 1
    @MikalaiLushchytski shouldn't it be files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath())); instead of files.putAll(listEligibleFiles(fileSystem, status.getPath())); – davyjones Jul 15 '22 at 14:55
  • 1
    @davyjones, right, thank you! Applied the change – Mikalai Lushchytski Jul 17 '22 at 06:48
0

Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.

You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • If you are aware of it then can you share git url? – MiniSu Aug 12 '21 at 17:27
  • I remember a related question about 4 years ago on the mailing list, but they didn't share any code (if I recall correctly). – David Anderson Aug 12 '21 at 19:23
  • i tried search there but could not find any result.Flink's mail searching is not tht advance as stack overflow. I have two questions here 1)Mikalai Lushchyski suggested one solution but as you said that ContinuousFileMonitoringFunction may change any time .So is there any other way to achive this programatically? 2)What kind of scripts are you suggesting here (as per your above comment)? Thanks – MiniSu Aug 19 '21 at 13:44