0

I am looking for ways to avoid the transfer of duplicate files when transferring through HTTP and SFTP. My system stores the state of the transfer each time a transfer is performed into an external cache.

Before each transfer, I look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped. This works well as long as my system is able to store the status in the cache each time the transfer happens. But in cases when the transfer is done and before writing the status of the transfer, the service dies, the service has no clue about the transfer and the next time the same file comes, I will re-transfer the file.

One way to improve this is to update the cache before and after the transfer is done so that I will have some clue about the file. But is there any other way to avoid this? Because once the file is transferred to the external system, there is no way to undo it when the writing of the status fails. Any thoughts?

Vijay Muvva
  • 1,063
  • 1
  • 17
  • 31

1 Answers1

0

I routinely synchronize external data and have written enough mastering processes to speak on the subject. You are asking for logistics solutions without even mentioning the context of the data and its purpose in being delivered to another location.

Are you trying to mirror a master copy of the file to another location? If so, then you need to simply deliver the file with a unique delivery number attached, allowing the recipient to independently synchronize both data sets and handle any detected differences in the files. If you are forcibly doing this work on behalf of the recipient, you may be destroying data. I consistently recommend having the recipient pull the data themselves as needed and synchronize/master it themselves, rather than pushing it. That way these business rules are organized where they should be. Push processes are bad.

Are you trying to allow users to overwrite a master file with their own copies, asking how to coordinate their uploads so that the file isn't overwritten? If so, you need to take away their direct control to overwrite that file. You need to separately synchronize each file according to a user-defined process, because each can have its own business rules.

When you say "look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped", you have given far too much responsibility to the deliverer. I say that, but how do you know? In manufacturing, no deliverer would be expected to do more than carry the load. Consumers are responsible for allocating that space. If the consumer truly needs the file, let it make the decision to order it and handle receiving it, rather than having the deliverer juggle such decisions.

RBJ
  • 128
  • 6