Intel MPI distributed memory: building a wall out of M*N blocks using q

Question

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.

Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.

The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.

Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.

This sounds similar to what h.265 (the video codec) does with [Wavefront Parallel Processing](http://x265.readthedocs.io/en/default/threading.html#wavefront-parallel-processing), where each block of video has a dependency on the block above it and to the left. Restricting dependencies to that pattern allows a lot more parallelism than with arbitrary dependencies. You might want to look at how that system is designed to get ideas for yours, except that you apparently don't have any dependencies between jobs, but you still want them to wait for each other. — Peter Cordes, Aug 30 '16 at 11:59
Why is it important that all M jobs are at a similar stage of progress? Would it be ok if some of the M jobs didn't start until others had finished (i.e. put all N steps into a single job-scheduler job)? Or would that lead to a situation where you only had a couple serial jobs left, so you can't take advantage of all your CPUs? Depending on the size of each step, cache might be important, so doing multiple steps on the same data on the same machine (or even the same CPU of the same cluster node) could be important. — Peter Cordes, Aug 30 '16 at 12:05
Another simple way may be to assign one CPU (for example p0) as Scheduler/Arbiter such that every CPU needs to register when its free. p0 could adaptively order jobs/blocks and assign them to CPUs. This will introduce some communication overhead thought. Something similar could be done with shared memory where whoever is free, takes the next block while the task of creating the "order" for easy picking is shared around — makadev, Aug 30 '16 at 12:07
@peterCordes It does not matter if the jobs aren't at similar stage, if M = integer * num_cpu, that would work (even so not optimized), — Amir Hajibabaei, Sep 01 '16 at 09:42
@makadev the solution with shared memory is interesting, If I can have a few shared variables, which are updated instantly in all processors (if they are altered by a specific cpu) I think it is possible to find a good solution. Not sure if its possible to have such variable with Intel MPI. — Amir Hajibabaei, Sep 01 '16 at 09:50
@AmirHajibabaei I believe so. At least MPI 3 [has sufficient](http://www.mpich.org/static/docs/v3.2/www3/MPI_Win_allocate_shared.html) functionality which should be supported by recent Intel MPI Implementations. I didn't use it myself thought, so can't tell exactly how to use it but there is a [stack overflow entry+comments](http://stackoverflow.com/a/17112315/3828957) which has few specifics. — makadev, Sep 01 '16 at 12:54

score 1 · Answer 1 · answered Aug 30 '16 at 12:12

1

You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.

If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.

If W is much smaller than M, and N is also fairly large, here is an idea:

Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).

The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.

It's up to you to find a reasonable K. Maybe try 10.

answered Aug 30 '16 at 12:12

John Zwinck

239,568
38
324
436

Similar in some ways to my answer, where I suggested a don't-get-too-far-ahead-of-the-least-complete-task step-counting idea. I'm sure there's something interesting to say about how these will differ in behaviour, but I'm drawing a blank right now. – Peter Cordes Aug 30 '16 at 12:22
@JohnZwinck I like the idea, If there isn't a solution using some kind of shared variables in MPI, statistical minimization is really the only solution. – Amir Hajibabaei Sep 01 '16 at 10:18
@AmirHajibabaei: MPI is for Message Passing, not Shared Memory. If you want the latter you can read up on RDMA for example. – John Zwinck Sep 01 '16 at 12:09
1

@AmirHajibabaei, MPI provides both portable access to shared memory and remote memory access. The former works only for processes that share a single node while the latter is not very efficient except with a handful of HPC vendor MPI implementations + the corresponding hardware. Dedicating one rank to act as work dispatcher is usually the simplest solution. That rank could also perform work in a separate thread or use a non-blocking mechanism like `MPI_Iprobe`. – Hristo Iliev Sep 01 '16 at 12:24

score 0 · Answer 2 · answered Aug 30 '16 at 12:18

0

Here's an idea, IDK if it's good:

Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.

When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.

I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.

Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

answered Aug 30 '16 at 12:18

Peter Cordes

328,167
45
605
847

This idea is appealing but I'm not sure how you'd implement it using MPI. – John Zwinck Aug 30 '16 at 12:27
@JohnZwinck: It's been years since I looked at MPI, and I never really did much of anything with it. If each worker maintains it's own table of the progress of each job, they could broadcast on completion of each step so other workers could update their tables. You still have a problem of switching tasks without having two workers claim the same task, though. That's a solved problem for shared memory, and I'd be shocked if there wasn't some way to do it with MPI. – Peter Cordes Aug 30 '16 at 12:31
@AmirHajibabaei: Only in a cache-coherent shared-memory system, like multiple threads on a single SMP machine. Then yes, you can have atomic lock-free shared variables. MPI is designed for message-passing, not shared memory, so AFAIK it won't really help you do this even if all your MPI workers are on the same machine. – Peter Cordes Sep 01 '16 at 10:25
@PeterCordes is it possible to have shared variables in MPI which are updated instantly in all processors (if they are altered by a specific cpu)? That would be a game changer. In the case of scheduler, communication would lead to synchronization of jobs, unless we assign a master to do only scheduling, which might be efficient if we have lots of processors. – Amir Hajibabaei Sep 01 '16 at 10:29
@AmirHajibabaei: I literally just answered that question. Are you running your MPI jobs on a single multi-processor machine? Then yes. Or are you running on a cluster? Then no. If you assign one thread to do scheduling, it will probably sleep most of the time, so you should choose the size of your MPI job accordingly (number of available cores + 1). – Peter Cordes Sep 01 '16 at 10:40

Intel MPI distributed memory: building a wall out of M*N blocks using q

2 Answers2