Topological Sort with Grouping

Question

Ok, so in topological sorting depending on the input data, there's usually multiple correct solutions for which order the graph can be "processed" so that all dependencies come before nodes that are "dependent" on them. However, I'm looking for a slightly different answer:

Suppose the following data: a -> b and c -> d (a must come before b and c must come before d).
With just these two constraints we have multiple candidate solutions: (a b c d, a c d b, c a b d, etc). However, I'm looking to create a method of "grouping" these nodes so that after the processing of a group, all of the entries in the next group have their dependencies taken care of. For the above supposed data I'd be looking for a grouping like (a, c) (b, d). Within each group it doesn't matter which order the nodes are processed (a before c or b before d, etc and vice versa) just so long as group 1 (a, c) completes before any of group 2 (b, d) are processed.

The only additional catch would be that each node should be in the earliest group possible. Consider the following:
a -> b -> c
d -> e -> f
x -> y

A grouping scheme of (a, d) (b, e, x) (c, f, y) would technically be correct because x is before y, a more optimal solution would be (a, d, x) (b, e, y) (c, f) because having x in group 2 implies that x was dependent on some node in group 1.

Any ideas on how to go about doing this?

EDIT: I think I managed to slap together some solution code. Thanks to all those who helped!

// Topological sort
// Accepts: 2d graph where a [0 = no edge; non-0 = edge]
// Returns: 1d array where each index is that node's group_id
vector<int> top_sort(vector< vector<int> > graph)
{
    int size = graph.size();
    vector<int> group_ids = vector<int>(size, 0);
    vector<int> node_queue;

    // Find the root nodes, add them to the queue.
    for (int i = 0; i < size; i++)
    {
        bool is_root = true;

        for (int j = 0; j < size; j++)
        {
            if (graph[j][i] != 0) { is_root = false; break; }
        }

        if (is_root) { node_queue.push_back(i); }
    }

    // Detect error case and handle if needed.
    if (node_queue.size() == 0)
    {
        cerr << "ERROR: No root nodes found in graph." << endl;
        return vector<int>(size, -1);
    }


    // Depth first search, updating each node with it's new depth.
    while (node_queue.size() > 0)
    {
        int cur_node = node_queue.back();
        node_queue.pop_back();

        // For each node connected to the current node...
        for (int i = 0; i < size; i++)
        {
            if (graph[cur_node][i] == 0) { continue; }

            // See if dependent node needs to be updated with a later group_id
            if (group_ids[cur_node] + 1 > group_ids[i])
            {
                group_ids[i] = group_ids[cur_node] + 1;
                node_queue.push_back(i);
            }
        }
    }

    return group_ids;
}

It sounds like you just want a "greedy" grouping. Find all nodes that can be in the first group. Then find all nodes that can be in the second group, etc., until no nodes are left unassigned. — aschepler, Nov 01 '10 at 21:20
Thanks for posting your solution, but could you better describe the expected input format? Maybe with an example? — mpen, Sep 23 '16 at 22:01
@mpen - The solution I posted expects a 2D vector adjacency matrix. I'd include the original input format but this question is almost 6 years old and I've since misplaced the original code. — Mr. Llama, Sep 24 '16 at 23:35
@Mr.Llama Oh..I think I get it. That's different than the format I've seen for other toposort implementations, but that'll do. — mpen, Sep 25 '16 at 16:33

smartnut007 · Accepted Answer · 2017-02-03T23:05:46.373

11

Label all root nodes with a level value 0. Label all children with level value parent+1. If, a node is being revisited i.e it already has a level value assigned, check if the previously assigned value is lower than the new one. If so, update it with the higher value and propagate them to the descendents.

now, you have as many groups as there are unique level labels 0 ... K

edited Feb 03 '17 at 23:05

answered Nov 01 '10 at 21:30

smartnut007

6,324
6
45
52

So kinda like a breadth first search from each root node where a child is only processed if it's value is updated to a larger number? (On a side note, would it matter if I did a depth first search instead?) – Mr. Llama Nov 01 '10 at 21:37
Which nodes are roots? Does "propagate them to the children" imply multiple passes over the same area of the graph? – Andy Thomas Nov 01 '10 at 21:42
I think breadth first search would be more appropriate here. But, if you find the DFS is easier for you to implement. And if you are dealing with a small graph ( a few hundred or a few thousand nodes max), then either approach could be fine. – smartnut007 Nov 01 '10 at 21:43
Andy: Its propagate in a conceptual sense. A good implementation would avoid ( BFS vs DFS ) would greatly minimize going over the same area multiple times. A root is a node that has no dependencies. – smartnut007 Nov 01 '10 at 21:46
Yeah, I think by checking if (parent_val + 1 > child_val) you could consider the node a candidate to be processed. If child_val is already greater, then there was some alternate way of getting there that had a longer dependency chain. – Mr. Llama Nov 01 '10 at 21:49
I hope that answer your question ? – smartnut007 Nov 01 '10 at 21:52
What would the worst-case time be? Say, if a chain were repeatedly processed by a succession of parent chains of increasing length? – Andy Thomas Nov 01 '10 at 22:05
I think in that case the worst case scenario would be considered O(n^2) even though the actual number would be considerably less than that. Literally it would be something close to O(x * y) where `x` is the number of increasing length chains leading into the body of node size `y`. The values for `x` and `y` would be (n - #) where # is different for each, but when you multiply it out you keep the largest term which ends up being n^2. Never been a fan of Big-O notation because it can be a bit misleading. – Mr. Llama Nov 01 '10 at 22:44
I am not sure how to derived at O(n^2). – smartnut007 Nov 01 '10 at 22:48
I believe you can have the same complexity as BFS. O(|E| + |V|). – smartnut007 Nov 01 '10 at 22:54
@smartnut007: correct. The absolute worst case would be if you have `(N*N-1)/2` edges, and each edge would cause the associated label to be updated by precisely one. – MSalters Nov 02 '10 at 11:48
ah yes, |E| is bound by n^2. but, that just makes it look quadratic :-) – smartnut007 Nov 02 '10 at 11:59
1

Followup question: using this algorithm, what should be done to check for the existence of cycles in the graph (without significantly worsening the time or memory complexity)? – Keavon Mar 17 '19 at 13:38

score -3 · Answer 2 · answered Feb 24 '11 at 05:53

I recently implemented this algorithm. I started with the approach you have shown, but it didn't scale to graphs of 20+ million nodes. The solution I ended up with is based on the approach detailed here.

You can think of it as computing the height of each node, and then the result is a group of each node at a given height.

Consider the graph:

A -> X

B -> X

X -> Y

X -> Z

So the desired output is (A,B), (X), (Y, Z)

The basic approach is to find everything with nothing using it(A,B in this example). All of these are at height 0.

Now remove A and B from the graph, find anything that now has nothing using it(now X in this example). So X is at height 1.

Remove X from the graph, find anything that now has nothing using it(now Y,Z in this example). so Y,Z are at height 2.

You can make an optimization by realizing the fact that you don't need to store bidirectional edges for everything or actually remove anything from your graph, you only need to know the number of things pointing to a node and the nodes you know are at the next height.

So for this example at the start:

0 things use 1
0 things use 2
2 things use X (1 and 2)
1 things use Y,Z (X)

When you visit a node, decrease the number of each of the nodes it points to, if that number goes to zero, you know that node is at the next height.

Topological Sort with Grouping

2 Answers2

Linked