4

I have N workers that need to process incoming batches of data. Each worker is configured so that it knows that it is "worker X of N".

Each incoming batch of data has a random unique ID (being random, it is uniformly distributed), and it has a different size; processing time is proportional to the size. Size can vary wildly.

When a new batch of data is available, it is immediately visible as available to all N workers, but I want only one to actually process it, without coordination among them. Right now, each worker calculates ID % N == X, and it it's true, the worker self-assigns the batch, while the others skip it. This works correctly and makes sure that, on average, each worker processes the same number of batches. Unfortunately, it doesn't take into account the batch size, so some workers can finish processing much later than others, because they might happen to self-assign very large jobs.

How can I change the algorithm so that each worker self-assigns batches in a way that also takes into account the size of the batch, so that on average, each worker will self-assign the same total size of work (from different batches)?

Giovanni Bajo
  • 1,337
  • 9
  • 15
  • Is `N` big (20 or more) or you cannot make any assumptions about it? – Sergey Kalinichenko Sep 26 '16 at 16:27
  • Good question. In my case it's something like 32 or 64, not 100000. – Giovanni Bajo Sep 26 '16 at 16:29
  • Do you know the distribution of job sizes? Are they uniformly distributed, too? – Sergey Kalinichenko Sep 26 '16 at 16:33
  • Do your batches need to be processed as a whole by one worker or would you be able to make workers process selected items? If so, you could let the workers cherry pick batches by doing a modulo on the index of the item within a batch. So, all workers process all batches but cherry pick with items they process. You could even combine this with your previous idea. Also, have you considered consistent hashing? – Jilles van Gurp Sep 26 '16 at 16:33
  • If you don't mind wasting some CPU you can do any deterministic algorithm to assign work and run it on all nodes at the same time. Each node can then take the work that's assigned to it. – mrmcgreg Sep 26 '16 at 16:38
  • @dasblinkenlight I have no information on job sizes. It varies wildly, and it's not uniformly distributed. – Giovanni Bajo Sep 26 '16 at 19:51
  • @JillesvanGurp I could investigate that, but I think it's going to be too expensive, because of the overhead of downloading and processing a batch without really running it. I can't see how to use consistent hashing for this... any pointers? – Giovanni Bajo Sep 26 '16 at 19:53
  • The idea is similar to what you are doing with the modulo. Instead of a modulo, you hash and then define your workers in terms of ranges on those hashes. This makes it easier to add workers since you can simply resize the ranges. https://en.wikipedia.org/wiki/Consistent_hashing. – Jilles van Gurp Sep 27 '16 at 08:00
  • Can you read the job size, while the job becomes visible? If your work assigning algorithm is deterministic then each worker can know the workload of other workers (Because he knows which worker will get the job). Same can be done using list of jobs completed, in a different way. If you do not have any access to job sizes or list of completed jobs, I cannot see a solution to your problem. – Nuri Tasdemir Sep 27 '16 at 14:46

3 Answers3

0
//Using a queue to store the workers
//This way we can dequeue and reenqueue workers when they accept jobs
var _queue = new Queue<Worker>[numOfWorkers];

void Setup() {
  for (int i = 0;i<numOfWorkers -1;i++) {
      _queue.Enqueue(new Worker());
  }
}

//Assigns the job to the next worker in line and puts it at the end of queue
void AcceptJob(Job j) {
    var w = FindNextAvailableWorker();
    w.AssignNewJob(j);
    _queue.Enqueue(_queue.RemoveAt(_queue.PositionOf(w)));
}

//Finds the first free worker or returns the front of queue
Worker FindNextAvailableWorker() {
    var w = _queue.front();

    while (int i=0;i<_queue.length-1<i++) {
        if (_queue[i].isWorking==false){
            w = _queue[i];
            exit loop;
        }  
    }

    return w; 
}
Dasith Wijes
  • 1,328
  • 12
  • 22
  • When you said "without coordination among them" I am assuming you mean without workers talking to each other. There can be an actor which coordinates all workers like above. – Dasith Wijes Sep 26 '16 at 16:51
0

The general idea: all nodes keep for each node the work it has done so far and this influences the work he will get.It is done in a deterministic way so all node will get the same results, and will not need to communicate .We still do a modulus however a node with less work has a bigger range of numbers.

Algorithm:

all workers do the same calculation. Each node holds an array of with elements holding all nodes ID and the parentage of work done so far by this node compared to total work of all nodes together.(5% of total work, 35% ...) we will call this nodeProportion.

This array is sorted by (100-nodeProportion)+0.001*Node_ID. When a batch arrive we do a HASH modulus 100 and get a number 1-100 call this number K.

we go over the sorted array and start subtracting the (100-nodeProportion) until we get zero or less. We give the work to that node.

All nodes do the same calculation so they do not need to talk.

O_Z
  • 1,515
  • 9
  • 11
  • How come "they do not need to talk" if "Each node holds an array ... (100-nodeProportion)+0.001*Node_ID ... We give the work to that node" ? You HAVE to update this array for each node, so they would be talking – Severin Pappadeux Sep 28 '16 at 16:08
  • @SeverinPappadeux All nodes do the same calculation so they have the same data without talking. They all choose the same node and all update the nodeProportion the same way. – O_Z Sep 30 '16 at 07:47
  • Isn't node proportion to be updated after calculation is done? Basically, only node with the job know if its free or not. And if it is free, it has to tell that to others – Severin Pappadeux Oct 02 '16 at 19:24
0

Ok, some considerations:

  • You don't want any metadata holder, nodes communication etc. So, the only good way would be some function X = distributor( arguments )
  • You have already function of very simple kind X = ID % N, but size, apparently, matters
  • Function cannot depend of size S alone, because then equal (big) size would be assigned to the same worker. We're looking for something like X = F(S, ID) % N
  • Function should produce uniform result, so final modulo op would provide uniform load

Simplest function to try would be

X = hash( ID * S ) % N

Some good hash function, multiplication ID*S shall produce array of bytes as typical input for hash, same size jobs would be distributed equally. Try it...

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64