2

While programming parallel programs a lot of time I encounter with problem of making something with unique group of 2 values. In other words this is handshaking problem where each man has to shake his hand with everyone else's. After handshaking there is a dinner waiting for all participants.

There are 2 approaches I'm aware of how to do this:

  1. Number all men and make it parallel for each man, who will shake hands with men that have lower number. Dinner will be cold before man with last number asks everyone to shake his hand.

  2. We will tell everyone how much participants there are and create condition based on that value, so everyone will shake aprox. equal number of hands in parallel. Dinner will be cold because participants are bad at math. (In this case people can act like someone else)

We can also exchange men for numbers and shaking for comparing, multiplying, etc. The problem is that we want eat that dinner before it is cold and waiting for one thread or making lot of conditions will slow the process down.

Is there other more efficient way to do this?

Raven
  • 4,783
  • 8
  • 44
  • 75
  • Why do you assume that participants are bad at math? – Simon Forsberg Jul 04 '12 at 17:30
  • Everyone will have to create his personal condition by computing the offset and that is already some overhead. And as I mentioned, people can act like someone other, but first they need to know who they are then, and that's also taking some time to determine. – Raven Jul 04 '12 at 17:40
  • You want to recast the `MPI_ALLTOALL` scaling problem to dinner handshaking? :) – Hristo Iliev Jul 04 '12 at 18:44
  • @HristoIliev Had to google this one.. "Sends data from all to all processes" isn't exactly what I try to do, but I believe it uses the same principle when you actually mentioned it (make pairs and communicate the data). Not that I don't like looking at new libs, but explanation would be still preferred :) It might be also interesting to note that I will be doing this on GPU with OpenCL, so e.g. thread spawning won't be very helpful. – Raven Jul 04 '12 at 18:57
  • You didn't mention in the beginning what kind of parallel programming you are doing. The most efficient all-to-all implementation on high-latency networks uses hierarchial communication (known as Bruck algorithm) while with large number of processes and nodes ring-like p2p works best, e.g. `i -> mod(i+1, N)`, then `i -> mod(i+2, N)`, ..., `i -> mod(i+N-1, N)` for each `i` in `[0 .. N-1]` (`A -> B` means `A` sends to `B`; communications are duplex). – Hristo Iliev Jul 04 '12 at 19:12
  • @HristoIliev sorry for not mentioning, I believed in a more general solution (therefore just a note). 2 comments on that algorithm: while GPU device memory is really high-latency one, I'm running this on fast local (shared) memory. And I tried running it on paper, it generates N*(N-1) pairs, while with duplex communication only half of that would be needed (when A to B, no need for B to A, or am I missing something? EDIT: Ah.. ring-like, that's it..). I now see that this is going to be conceptually too specific for GPU hardware. – Raven Jul 04 '12 at 19:38
  • 1
    Will we discuss Philosophy at this dinner? – RBarryYoung Jul 04 '12 at 19:44

1 Answers1

0

I'm not sure how much you are constrained by "bad at math" but you might have a look at How to automatically generate a sports league schedule, with good answer that refers to http://en.wikipedia.org/wiki/Round-robin_tournament.

In working out how efficient or not a synchronisation primitive is, you need to keep track separately of time spent waiting for lagging processes to catch up, and time spent in the synchronisation itself. If your time is actually being spent waiting for laggards to catch up, you need to speed them up (e.g. by spreading work around more evenly) or avoid the necessity of waiting for them - speeding up what happens after everybody has waited for everybody else may not help much.

Community
  • 1
  • 1
mcdowella
  • 19,301
  • 2
  • 19
  • 25