1

Assume I have activities or tasks that

  1. should all be executed.
  2. has no predetermined time, but some activities take longer than others
  3. are not CPU bound and subject to network/IO latency and transient errors
  4. have dependencies on others; in the example below C can only execute once A and B was complete.

What is the most appropriate algorithm to use to schedule activities to minimize the total time to complete all tasks? My current approach is less than optimal, because (in the example below) the way G is scheduled adds an additional delay of 20s to execution. The answer to this question got me down the path where I am.

Here's an example (if it was a DSL)

Task A
{
    Estimation: 10s;
}

Task B
{
    Estimation: 10s;
}

Task C
{
    Estimation: 10s;
    DependsOn A, B;
}

Task D
{
    Estimation: 10s;
    DependsOn C;
}

Task E
{
    Estimation: 10s;
    DependsOn C;
}

Task F
{
    Estimation: 10s;
    DependsOn E, D;
}

Task G
{
    Estimation: 30s;
    DependsOn A, B;
}

Here's what I did (in C#)

Created a graph (Directed acyclic graph) of activities.

The following code snippet if from a TaskManager class.

private static Graph<ITask> CreateGraph(IEnumerable<ITask> tasks)
{
    if (tasks == null)
        throw new ArgumentNullException(nameof(tasks));

    var nameMap = tasks.ToDictionary(task => task.Id);
    var graph = new Graph<ITask>(nameMap.Values);
    foreach (var task in nameMap.Values)
    {
        foreach (var depdendantTaskName in task.DependsOn)
        {
            var from = nameMap[depdendantTaskName];
            var to = task;
            graph.AddDependency(from, to);
        }
    }
    return graph;
}

Perform a Topological Sort

public static Node<T>[] Sort<T>(this Graph<T> graph) where T : IComparable
{
    var stack = new Stack<Node<T>>();
    var visited = new HashSet<Node<T>>();

    foreach (var node in graph)
    {
        if (!visited.Contains(node))
        {
            visited.Add(node);
            InternalSort(node, stack, visited);
        }
    }
    return stack.ToArray();
}

private static void InternalSort<T>(Node<T> node, Stack<Node<T>> stack, ISet<Node<T>> visited)
    where T : IComparable
{
    var dependants = node.Dependants;
    foreach (var dependant in dependants)
    {
        if (!visited.Contains(dependant))
        {
            visited.Add(dependant);
            InternalSort(dependant, stack, visited);
        }
    }
    stack.Push(node);
}

This gave me something like [F,E,D,C,G,B,A]. If I used dependencies instead of dependents, it would have been [A,B,C,G,D,E,F].

Assign a Level to Each Node

Now that I have an array of sorted nodes, the next is to update the level property of each node.

public static void Level<T>(this IEnumerable<Node<T>> nodes) where T : IComparable
{
    foreach (var sortedTask in nodes)
    {
        sortedTask.Level = CalculateLevel(sortedTask.Dependencies);
    }
}

public static int CalculateLevel<T>(ICollection<Node<T>> nodes) where T : IComparable
{
    if (nodes.Count <= 0) return 1;
    return nodes.Max(n => n.Level) + 1;
}

This gave me something like [F:1,G:1,E:2,D:2,C:3,B:4,A:4] where the letter is the activity name and the number is the level. If I did this in the reverse, it would have looked something like [F:4,E:3,D:3,G:2,C:2,B:1,A:1].

Group tasks

public static SortedDictionary<int, ISet<T>> Group<T>(this IEnumerable<Node<T>> nodes) where T : IComparable
{
    var taskGroups = new SortedDictionary<int, ISet<T>>();
    foreach (var sortedNode in nodes)
    {
        var key = sortedNode.Level;
        if (!taskGroups.ContainsKey(key))
        {
            taskGroups[key] = new SortedSet<T>();
        }
        taskGroups[key].Add(sortedNode.Value);
    }
    return taskGroups;
}

Execute Tasks

The following goes through each "level" and executes the tasks.

private async Task ExecuteAsync(IDictionary<int, ISet<ITask>> groups, ITaskContext context,
    CancellationToken cancellationToken)
{
    var keys = groups.Keys.OrderByDescending(i => i);
    foreach (var key in keys)
    {
        var tasks = groups[key];
        await Task.WhenAll(tasks.Select(task => task.ExecuteAsync(context, cancellationToken)));   
    }
}

The OrderByDescending was necessary if tasks were sorted from most dependent to least dependent node (F first, A or B last)

Problem

While this approach still executes faster than a sequential approach, no matter how I approach it, something is always waiting on G to complete. if G is grouped with C, then D and E will be delayed by 20s even though they are not dependent on G.

If I reverse the sorting (and adjust the code), the G only starts executing when F starts executing.

Community
  • 1
  • 1
bloudraak
  • 5,902
  • 5
  • 37
  • 52
  • How many tasks can execute in parallel? 1? Arbitrarily many? – j_random_hacker Jul 05 '16 at 15:51
  • An arbitrary number. For the workflow in working on it varies from 1 to several thousand, but not tens of thousands. Since they are IO bound (mostly network), there are other optimizations that don't block threads etc. it's also fine if execution is managed using a thread pool/queue. – bloudraak Jul 05 '16 at 17:48

1 Answers1

0

Since you say (in a comment) that there is no limit on the number of tasks that can execute simultaneously, there's an easy solution:

  1. Set taskState[i] = UNSTARTED for every task i.
  2. For each task i that has no remaining dependencies (i.e., empty DependsOn list) and has not yet been started (i.e., taskState[i] == UNSTARTED) (note that sometimes there might be no such tasks):
    • Start the task.
    • Set taskState[i] = RUNNING.
  3. If there are no tasks currently running then stop -- either you have completed all tasks, or there is a cyclic dependency. (You can tell which by checking whether there is any task i such that taskState[i] == UNSTARTED.)
  4. Wait for any running task to complete. Let this be task i.
  5. Set taskState[i] = FINISHED.
  6. Loop through all tasks that have not yet been started, removing task i from each such task's DependsOn list if it exists.
  7. Goto 2.
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • The challenge is 4, especially if you're running hundreds of concurrent tasks. How do you effectively wait for that many tasks, without running into the same issue described in OP. – bloudraak Jul 07 '16 at 08:07
  • Googling "wait for any task C#" got me the answer on the first hit: https://msdn.microsoft.com/en-us/library/dd537610(v=vs.100).aspx – j_random_hacker Jul 07 '16 at 10:10
  • If this solved your problem, how about a tick/upvote? If it didn't, what problems are left to solve? – j_random_hacker Jul 13 '16 at 10:59
  • The algorithm worked conceptually. But I don't think I'll understand this a year from now as I inferred a lot of context. The use of an unsorted arrays are also ineffecient, so I sticked with the graph. – bloudraak Jul 30 '16 at 06:44
  • "The use of an unsorted arrays are also inefficient" -- are you referring to choosing a task to run in step 2? You can just keep all tasks in a priority queue (heap) ordered by `length(DependsOn)` to make this step O(log n) instead of O(n). – j_random_hacker Aug 01 '16 at 14:45
  • I used a topological sort to sort tasks. What you also don't account for is that the TPL immediately queues a task to be executed when created. So a TPL task has to be wrapped to allow for the algorithm to work. TPL tasks also have no notion of dependencies. It would be best to update the answer with the missing pieces. – bloudraak Aug 01 '16 at 14:59
  • I don't know what "the TPL" is. If you want, you can update the answer with it yourself, though if it's something C#-specific it would be better left out, IMO. – j_random_hacker Aug 01 '16 at 15:09