I am trying to find a horizontally scaling solution to the problem described in the title.
A more detailed explanation of the problem would be: from a message queue web service, read a message containing a URL to a file uploaded somewhere, download the file, parse it, and append its contents to a file whose location is dependent on the contents.
Because of the high volume of messages coming in the queue (assume 100 messages per second continuously), if performing concurrent processing by multiple workers, there is a possibility that data may get lost if there is no controlled access to the files.
A particular information that is relevant is that within a batch of messages, it is unlikely that two messages would be for the same destination file (assume this would occur for 1% of the messages, with even distribution), and the speed to process a message and its file is slightly above the speed to read the message from the queue, lowering the probability of a collision by quite a bit.
Losing some data may be acceptable if it's a really low probability, but I don't have an exact number.
What are the available algorithms or design patterns for this?
Some specifics:
- 10 million distinct output files
- 5 million messages a day
- file storage is provided by a third-party webservice with unlimited concurrent read/writes
- message order has no importance
- a message only contains a URL to a file (with GUID as its name)