Searching an algorithm similar to producer-consumer

Question

I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:

X processes generate a list of very large files. Each process generates one file at a time
Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
Each file has a size and naturally, bigger files will keep both X and Y more busy

Limitations

Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.

I can't think of other limitations at the moment.

Disadvantages to this approach

sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.

There are some more details that would help to see the picture more clearly.

Y processes are DB threads/jobs
X processes are web apps
Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part

Now I considered the following approach:

X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.

Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:

How to decide which files to fetch

I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.

I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.

The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.

Update Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.

Jobs are independent from Processes. They will run at a speed and process how many files are possible.
When a Job finishes with a file it will send a HTTP request to the LB
Each process queues requests (files) coming from the LB
The LB works on a round robin rule

Diagram

@harold **Your comment has dissapeared** You're absolutely right. Optimising the LB is the best approach but sadly I can not do that. Building another interface which I could control and apply a weighted algorithm seems to much but you're right, picking the smallest doesn't help. I'll think about the option to build another layer between the LB and system to handle this management. — user2201862, Mar 23 '13 at 09:54
I don't understand why you're making things so complicated: if a file requires a long time to get processed, just let that thread which got it work on it (one thread has to do it anyway) and let the other threads process the rest. — didierc, Mar 23 '13 at 11:20
Why do you have a separate job queue for each consumer? This will always be a problem. Just move everything to a common queue and let consumers fetch jobs from it as they become ready. — n. m. could be an AI, Mar 23 '13 at 11:25
@didierc I agree. That's why I said that once a file (regardless of the size) gets in a queue it will remain there. The question is how to select the best queue (with less files and smaller overall size). — user2201862, Mar 23 '13 at 12:13
@n.m. Multiple jobs means more files processed at the same time. We are talking about a cluster with multiple servers for DB, multiple servers for the app, etc. Having one queue to serve 100 apps (for example) would be inefficient. And the level of data spans over thousands and thousands of files raging from a couple of KB to hundreds of MB — user2201862, Mar 23 '13 at 12:15
The simplest solution is to not have a job queue forceach Y process, but have a global Y job queue, and play it on a first come first served basis. Otherwise, there is an algorithm to split a certain length in equal cuts, called the cut stock algorithm ([wikipedia](http://en.m.wikipedia.org/wiki/Cutting_stock_problem)), it requires dynamic programming. — didierc, Mar 23 '13 at 12:23
There seems to be a single instance of the load balancer in the system. Why not make it also a common queue keeper? — n. m. could be an AI, Mar 23 '13 at 12:34
I've added a diagram. @didierc Agree. I will have to go around the fact that without a central manager (the LB in this case) there is no other solution. — user2201862, Mar 23 '13 at 14:22

Gene · Accepted Answer · 2013-04-02T15:40:01.607

The current LB idea is not good

The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.

Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.

This way consumers will always be optimally busy.

You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.

If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.

If the LB absolutelly cannot queue work

Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.

To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.

The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.

A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.

The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.

If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.

The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.

Thanks a lot Gene. Your answer (whilst being the only one) is also very complete. I have chosen to drop the LB completely and simply go with the solution where each app will take the first available bill. The master queue will be stored on the database side. This also helps to facilitate additional logic (for example some apps can handle larger files, etc) — user2201862, Apr 02 '13 at 09:25

Searching an algorithm similar to producer-consumer

1 Answers1