How to avoid duplicate file processing on SFTP

Question

We have one java application that polls files from client FTPs at every 30 minutes interval and then do a scan of all the files and see what all files are matching with the patterns configured inside the application and based on that process the files accordingly. The problem here is that we have to do linear scan at every 30 minutes and this is taking too much time. Since we do not want to process duplicate files so we maintain hashcode of file at our end and then we check if the hashcode is matching with the existing hashcodes. Deletion of processed file is not possible because of permissions. Need help here on how to optimize this.

We are using SSHJ library for SFTP communications.

So cannot you just remember names of processed files, instead of hashing their contents? — Martin Prikryl, May 04 '18 at 11:44
@MartinPrikryl yes because sometimes two different files are present with the same name. So that is why we are storing hash of content. — anurag garg, May 04 '18 at 17:08
To calculate a hash you always need to read the contents, so it will always take a lot of time — Hiery Nomus, May 07 '18 at 06:23
I'm not sure how that answers my question. What is "two different days file"? Are files removed each day? Or what? — Martin Prikryl, May 08 '18 at 10:35
How about tracking the changetime of the file. If you keep the knowledge of the last scan timestamp, just read the files newer than that. — Hiery Nomus, May 08 '18 at 12:59
Would it be sufficient to only save the filename and the timestamp of each file? if the scan finds a file that has no match for BOTH values (filename and timestamp) it would be downloaded. It is really consuming to first download the file just to figure out that we already have this file. Delete was not possible for original files but can you rename/move the original files after download? — Jokkeri, Jul 17 '18 at 10:42

score 0 · Answer 1 · edited Jul 03 '18 at 14:09

0

You can't compare hashcode from client side to hashcode of a file with server side file (because you have to download the file).

What you could do is: Execute ls command on server, get fileinfo (date, size, name, isDir) information. use this as hashcode to compare. Skip those whose hashcode already exists.

edited Jul 03 '18 at 14:09

Banana

2,435
7
34
60

answered Jul 03 '18 at 11:14

Akash Srivastava

49
2

score 0 · Answer 2 · answered Jul 04 '18 at 18:50

Why not to use an SFTP Inbound Channel Adapter and combine it with SftpPersistentAcceptOnceFileListFilter? That will keep track of the already downloaded files, and not download them twice.

<int-sftp:inbound-channel-adapter ...
    filter="remoteFilter">...</...

<bean class="org.springframework.integration.sftp.filters.SftpPersistentAcceptOnceFileListFilter">
    <constructor-arg name="store" ref="metadataStore"/>
    <constructor-arg value="myapp"/>
</bean>

<bean name="metadataStore" class="org.springframework.integration.metadata.PropertiesPersistingMetadataStore">
    <property name="baseDirectory" value="./metadata"/>
</bean>

How to avoid duplicate file processing on SFTP

2 Answers2