Algorithm for combining n files into one concurrently

Question

I am trying to find a horizontally scaling solution to the problem described in the title.

A more detailed explanation of the problem would be: from a message queue web service, read a message containing a URL to a file uploaded somewhere, download the file, parse it, and append its contents to a file whose location is dependent on the contents.

Because of the high volume of messages coming in the queue (assume 100 messages per second continuously), if performing concurrent processing by multiple workers, there is a possibility that data may get lost if there is no controlled access to the files.

A particular information that is relevant is that within a batch of messages, it is unlikely that two messages would be for the same destination file (assume this would occur for 1% of the messages, with even distribution), and the speed to process a message and its file is slightly above the speed to read the message from the queue, lowering the probability of a collision by quite a bit.

Losing some data may be acceptable if it's a really low probability, but I don't have an exact number.

What are the available algorithms or design patterns for this?

Some specifics:

10 million distinct output files
5 million messages a day
file storage is provided by a third-party webservice with unlimited concurrent read/writes
message order has no importance
a message only contains a URL to a file (with GUID as its name)

How many output files are there in total? Can output messages be grouped together in any way? Is the file storage on the same server? Must the message order be retained? — Stefan Hanke, Jul 10 '14 at 03:38
What is the disk concurrency / striping / whatever for the file storage, i.e. how many files can be written concurrently? Are they SSDs or magnetic disks? — Zim-Zam O'Pootertoot, Jul 10 '14 at 05:14

Ian Mercer · Answer 1 · 2014-07-10T05:48:04.593

Since you can scale the basic work of downloading and appending arbitrarily across any number of workers the key issue here appears to be how to guarantee that only one file update happens at a time. Some ways to achieve that:-

Option 1: Split the downloading from the appending. Multiple 'download' workers: fetch the content, calculate the location, calculate a hash of the location, put the content into a writer queue based on the hash. Multiple 'writer' workers, each one consumes a single queue, processes that queue in sequence with the guarantee that no other writer will be attempting to update the same location. You may need to implement some form of consistent hashing to allow the system to survive arbitrary failures gracefully.

Option 2: Create a separate system for locking Multiple workers, each downloads the content, calculates the location, gets a lock on the location in the secondary system (database, file system, in-memory distributed cache), performs the append operation, releases the lock. Essentially this becomes a distributed lock problem.

Hynek -Pichi- Vychodil · Answer 2 · 2014-07-10T07:07:46.980

1

I don't see what's the catch. You probably forgot mentioning it. For problem as you described it there is very simple solution. Just distribute messages across pool of worker nodes in round robin or balanced manner. Each worker will load the file, parse and store in the third party storage . That's all.

Look for some (distributed) message queue solution like RabitMQ for example.

Edit: So it turns out it's the dumb storage problem. There have to be some real storage layer in front of the dumb third party storage which provides "atomic" append and transparent compression/decompression. There are techniques for building scalable storage. Look at famous Dynamo paper. Because you have very narrow feature requirements you can easily write your own solution around some open source ring implementation as Riak Core from Riak and use the third party storage as backend.

I will describe basic principle in short. You divide destination space into buckets by (consistent) hashing. Then you keep serializer for each bucked which provides atomic operations for you. It is append and transparent (de)compression in your case. The serializer keeps state and also works as cache. So it looks like lock free from outside.

edited Jul 10 '14 at 07:07

answered Jul 10 '14 at 06:32

Hynek -Pichi- Vychodil

26,174
5
52
73

The "catch" is that there are n messages and m files, with m < n. If two consecutive messages that should go to one destination file get processed at the same time by two workers, how do you ensure no data is lost? Based on the low probability, is there a way to avoid or minimize a locking mechanism? – JRL Jul 10 '14 at 06:41
@JRL Do the storage provide append operation? You didn't mention it. – Hynek -Pichi- Vychodil Jul 10 '14 at 06:43
No, just get and put operations. Further, the data is text data that must be compressed/uncompressed to reduce storage costs. – JRL Jul 10 '14 at 06:45
Oh, I see now. It's the storage problem. The storage is wrong so you have to make your own. – Hynek -Pichi- Vychodil Jul 10 '14 at 06:48

Algorithm for combining n files into one concurrently

2 Answers2