I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
- X processes generate a list of very large files. Each process generates one file at a time
- Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
- At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
- Each file has a size and naturally, bigger files will keep both X and Y more busy
Limitations
- Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.
I can't think of other limitations at the moment.
Disadvantages to this approach
- sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
- sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
- 1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
- The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.
There are some more details that would help to see the picture more clearly.
- Y processes are DB threads/jobs
- X processes are web apps
- Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part
Now I considered the following approach:
- X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
- Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Update Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
- Jobs are independent from Processes. They will run at a speed and process how many files are possible.
- When a Job finishes with a file it will send a HTTP request to the LB
- Each process queues requests (files) coming from the LB
- The LB works on a round robin rule