Optimized algorithm to schedule tasks with dependency?

Question

There are tasks that read from a file, do some processing and write to a file. These tasks are to be scheduled based on the dependency. Also tasks can be run in parallel, so the algorithm needs to be optimized to run dependent tasks in serial and as much as possible in parallel.

eg:

A -> B
A -> C
B -> D
E -> F

So one way to run this would be run 1, 2 & 4 in parallel. Followed by 3.

Another way could be run 1 and then run 2, 3 & 4 in parallel.

Another could be run 1 and 3 in serial, 2 and 4 in parallel.

Any ideas?

What is `A,B,...`? Does running `1` & `2` in parallel imply that `A` is run twice? Is that a bad thing? — Jacob, Aug 19 '13 at 13:05
my understanding is that A,B ... are tasks, and 1,2,3 are dependence declaration.I would say that typically D depends on B that depends on A, and so on. — njzk2, Aug 19 '13 at 13:17
@njzk2 1,2,3 and 4 are tasks. A, B etc. are files. So task 1 reads from A and writes to B. So Task 3 cannot start unless task 1 finishes. — user2186138, Aug 20 '13 at 08:03
@user2186138 : ok, then you need to start by modeling this as '1->3' (which is then the only dependency) — njzk2, Aug 20 '13 at 08:27
Does this answer your question? [Execution of Directed Acyclic Graph of tasks in parallel](https://stackoverflow.com/questions/63354899/execution-of-directed-acyclic-graph-of-tasks-in-parallel) — Anmol Singh Jaggi, May 08 '21 at 15:35

score 16 · Accepted Answer · edited Jun 20 '20 at 09:12

16

Let each task (e.g. A,B,...) be nodes in a directed acyclic graph and define the arcs between the nodes based on your 1,2,....

You can then topologically order your graph (or use a search based method like BFS). In your example, C<-A->B->D and E->F so, A & E have depth of 0 and need to be run first. Then you can run F,B and C in parallel followed by D.

Also, take a look at PERT.

Update:

How do you know whether B has a higher priority than F?

This is the intuition behind the topological sort used to find the ordering.

It first finds the root (no incoming edges) nodes (since one must exist in a DAG). In your case, that's A & E. This settles the first round of jobs which needs to be completed. Next, the children of the root nodes (B,C and F) need to be finished. This is easily obtained by querying your graph. The process is then repeated till there are no nodes (jobs) to be found (finished).

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 19 '13 at 13:25

Jacob

34,255
14
110
165

how does this algorithm knows that B should have a higher priority than F ? – njzk2 Aug 19 '13 at 13:30
That is accomplished by your graph search algorithm. You can construct the graph easily by going down your list of arcs (e.g.`1.A->B`). Read the algorithms for topological ordering. – Jacob Aug 19 '13 at 13:33
I think topological ordering is not sufficient, as B and F having no relationship to one another, they cannot be ordered this way. A priority system must be added, I think using the number of dependency for a given node. – njzk2 Aug 19 '13 at 13:41
Why shouldn't `B` & `F` be processed at the same time? – Jacob Aug 19 '13 at 13:44
I am not saying they shouldn't. I'm saying if 1 of (B, F) must be choosen to be processed, B should, because processing B doesn't reduce the pool of tasks available to be processed (if this pool is smaller than your pool of executors, you are wasting execution time) – njzk2 Aug 19 '13 at 13:50
example in this case, 2 executors, equal execution time for all tasks : round 1: A+E, 2nd round: C+F, 3rd round: only B is available, 4th round: D. If B is seen as more important => 1: A+E, 2:B+C, 3:D+F – njzk2 Aug 19 '13 at 13:53
Sure, there are several ways to parallelize based on the number of workers. We don't know if that's a limitation yet ; the OP is concerned with dependencies which is solved by the ordering. – Jacob Aug 19 '13 at 13:57
1

Agreed. Topological sorting is totally the proper way to solve the dependence resolution. – njzk2 Aug 19 '13 at 14:09
Sorry for the confusion. The numbered lines are the tasks. A, B, C, D are files. So task 1 reads from A and loads into B. But to load D you need to read B. So it has a dependency on task 1. – user2186138 Aug 19 '13 at 19:27
@Jacob A, B, C, D are not tasks but files. – user2186138 Aug 19 '13 at 19:29
What's the complexity of this algorithm? – Rajan May 24 '17 at 00:39

score 10 · Answer 2 · answered Aug 20 '13 at 05:19

Given a mapping between items, and items they depend on, a topological sort orders items so that no item precedes an item it depends upon.

This Rosetta code task has a solution in Python which can tell you which items are available to be processed in parallel.

Given your input the code becomes:

try:
    from functools import reduce
except:
    pass

data = { # From: http://stackoverflow.com/questions/18314250/optimized-algorithm-to-schedule-tasks-with-dependency
    # This   <-   This  (Reverse of how shown in question)
    'B':         set(['A']),
    'C':         set(['A']),
    'D':         set(['B']),
    'F':         set(['E']),
    }

def toposort2(data):
    for k, v in data.items():
        v.discard(k) # Ignore self dependencies
    extra_items_in_deps = reduce(set.union, data.values()) - set(data.keys())
    data.update({item:set() for item in extra_items_in_deps})
    while True:
        ordered = set(item for item,dep in data.items() if not dep)
        if not ordered:
            break
        yield ' '.join(sorted(ordered))
        data = {item: (dep - ordered) for item,dep in data.items()
                if item not in ordered}
    assert not data, "A cyclic dependency exists amongst %r" % data

print ('\n'.join( toposort2(data) ))

Which then generates this output:

A E
B C F
D

Items on one line of the output could be processed in any sub-order or, indeed, in parallel; just so long as all items of a higher line are processed before items of following lines to preserve the dependencies.

A, B, C etc. are just file names. The tasks are numbered 1-4. So task 1 reads from A and loads file B and so on. — user2186138, Aug 20 '13 at 07:28
so basically I'll have to take these tasks, build a dependency relationship between them like 3 --> 1 and then make use of topological sort. — user2186138, Aug 20 '13 at 07:29
Yep. That's right. Would that cause a problem @user2186138? (P.S. You have a large proportion of questions where you have not accepted any answer). — Paddy3118, Aug 21 '13 at 04:29

score 2 · Answer 3 · answered Aug 19 '13 at 13:28

Your tasks are an oriented graph with (hopefully) no cycles.

I contains sources and wells (sources being tasks that don't depends (have no inbound edge), wells being tasks that unlock no task (no outbound edge)).

A simple solution would be to give priority to your tasks based on their usefulness (lets call that U.

Typically, starting by the wells, they have a usefulness U = 1, because we want them to finish.

Put all the wells' predecessors in a list L of currently being assessed node.

Then, taking each node in L, it's U value is the sum of the U values of the nodes that depends on him + 1. Put all parents of the current node in the L list.

Loop until all nodes have been treated.

Then, start the task that can be started and have the biggest U value, because it is the one that will unlock the largest number of tasks.

In your example,

U(C) = U(D) = U(F) = 1
U(B) = U(E) = 2
U(A) = 4

Meaning you'll start A first with E if possible, then B and C (if possible), then D and F

score 1 · Answer 4 · answered Aug 19 '13 at 13:39

first generate a topological ordering of your tasks. check for cycles at this stage. thereafter you can exploit parallelism by looking at maximal antichains. roughly speaking these are task sets without dependencies between their elements.

for a theoretical perspective, this paper covers the topic.

score 0 · Answer 5 · answered Dec 22 '14 at 23:00

Without considering the serial/parallel aspect of the problem, this code can at least determine the overall serial solution:

def order_tasks(num_tasks, task_pair_list):
    task_deps= []
    #initialize the list
    for i in range(0, num_tasks):
        task_deps[i] = {}

    #store the dependencies
    for pair in task_pair_list:
        task = pair.task
        dep = pair.dependency

        task_deps[task].update({dep:1})

    #loop through list to determine order
    while(len(task_pair_list) > 0):
        delete_task = None

        #find a task with no dependencies
        for task in task_deps:
            if len(task_deps[task]) == 0:
                delete_task = task
                print task
                task_deps.pop(task)
                break

        if delete_task == None:
            return -1

        #check each task's hash of dependencies for delete_task
        for task in task_deps:
            if delete_key in task_deps[task]:
                del task_deps[task][delete_key]

    return 0

If you update the loop that checks for dependencies that have been fully satisfied to loop through the entire list and execute/remove tasks that no longer have any dependencies all at the same time, that should also allow you to take advantage of completing the tasks in parallel.

Optimized algorithm to schedule tasks with dependency?

5 Answers5

Update:

Linked