Graph Algorithm / Disjoint Set

Question

I am trying to solve this problem but unable to do it fast.

In short - we have a graph (directed) and we want to find out from which node (a set of nodes to select from is given) we can visit the most nodes. A straightforward implementation will be to run DFS/BFS from every node and see how many we can visit. But that it too slow as there are over 5000 nodes in the graph. Running 5000 BFS/DFS will be take very long time.

On the other hand I also get a feeling that this problem may have something to do with Disjoint Set data structure? But I am unable to formulate it that way as in my disjoint set implementation some of the mentioned rules.

Can someone give a hint as to how to approach this problem?

The problem you describe here is a lot more general to the problem you linked to. — Niklas B., Mar 17 '14 at 17:58
@Niklas - Sorry I didn't get it ! The explanation that I gave is just what I came up with to solve it but it's giving TLE straightaway. Do you think Amit's description is correct? — Varun Sharma, Mar 17 '14 at 18:14
@VVV: amit's answer assumes a general graph, while the graph in the problem is a DAG. Also, amit's algorithm has a straightforward *O(k(n + m))* implementation which you said was too slow. The problem doesn't say how large *k* can get, which is pretty bad. — Niklas B., Mar 17 '14 at 18:27
Thanks Niklas, can you elaborate on Amit's step 3? I have written my understanding below the answer. Also I did manage to implement the Dominating set algorithm for social advertising problem, even thought it took me 11 attempts and 5 hours :-). — Varun Sharma, Mar 17 '14 at 18:32
@VVV: Good job. I have only one idea with persistent binary search trees and union-by-rank, but I'm not sure if that is the simplest solution — Niklas B., Mar 17 '14 at 18:37
The idea is described [here](http://codeforces.ru/blog/entry/10696?locale=en). The discussion is about trees, but kingofnumbers' argument still holds. You just need persistency, which complicates matters — Niklas B., Mar 17 '14 at 18:40
@VVV: Can you point me to somewhere where I can submit a solution for this problem? — Niklas B., Mar 17 '14 at 19:17
Yes wait a min - I will give you the link for social advertising and Influence (both):-) — Varun Sharma, Mar 17 '14 at 19:18
I don't care about social advertising, but this one looks tricky, I wonder how often it got solved during the contest — Niklas B., Mar 17 '14 at 19:18
https://icpcarchive.ecs.baylor.edu/index.php?option=com_onlinejudge&Itemid=8&category=623&page=show_problem&problem=4461 (there is a submit button at the top right but you need to register) — Varun Sharma, Mar 17 '14 at 19:20
https://icpcarchive.ecs.baylor.edu/index.php?option=com_onlinejudge&Itemid=8&category=613&page=show_problem&problem=4443 (Influence) — Varun Sharma, Mar 17 '14 at 19:20
Around 15 out of 80 teams solved this problem during the contest. Here is the scoreboard - http://acm.ro/results.htm and here is the list of problems - http://acm.ro/problems.htm — Varun Sharma, Mar 17 '14 at 19:23
@VVV Thanks pal, seems like it was not too hard after all, I will try some of the naive approaches to check this out. Maybe the test data is not really good — Niklas B., Mar 17 '14 at 19:25
@niklas - man you are genius ! After giving you the links I walked to my office (20 mins walking) and I see you got AC in 0.116 (rank 2). Crazy ! So you used SCC/Top Sort approach? — Varun Sharma, Mar 17 '14 at 19:58
Thanks for giving the idea of c++ bitset ! It helped me in social advertising ! Before I always used to make graphs using vector > etc. — Varun Sharma, Mar 17 '14 at 19:59
I also use an array of `vector` for adjacency lists. But to merge the reachability sets, a bitset is useful — Niklas B., Mar 17 '14 at 19:59
@Niklas - Sorry DFS from every node? but that's too slow ? or we do DFS from every node till every node is visited and also we keep the last node removed from the stack somewhere from each DFS ! Later on - we run DFS from there ? Is it something like that? — Varun Sharma, Mar 17 '14 at 20:00
I just use the DFS to get an implicit toplogical sort. I want to compute for every node *x* its set of reachable nodes. This set is the union of the reachable sets of its "children" nodes, so those need to be solved first. If you want I can paste the code somewhere, it's really simple — Niklas B., Mar 17 '14 at 20:02
Oh OKay - I think I get what you say ! Thanks buddy I will try this problem after work tonight. Lets see how it goes. Thanks for your time. Normally people just give the idea. On the other hand you got AC as well during the discussion of the problem. Unbelievable :-) — Varun Sharma, Mar 17 '14 at 20:04
Ya you can paste hte code because I need to learn how to implement it. Thanks. — Varun Sharma, Mar 17 '14 at 20:04
Most people on Stack Overflow are not really interested/active in competitive. I am, so it's good practice for me to code stuff anyways. Here's the code: http://pastie.org/8939175 hope you can learn something from it, and please don't adapt my horrific contest code style for anything but contest programming ;) — Niklas B., Mar 17 '14 at 20:06
I am also learning it now (finished university 4 years ago). I believe the fundamental algorithms will always remain the same. Tarjan's algorithm was invented in 1972 and even after 42 years people are using it. I doubt if any framework/JavaScript libraries used today will still be used after 40 years. That's why I am investing a lot of time in it now even though my work involves web development (.net, MVC etc.) — Varun Sharma, Mar 17 '14 at 20:09
Thanks buddy ! You gave so much of time into this whole stack-overflow question. Most the time people are busy thinking if its a valid question / or why i am using words like hi, thank you / or should they down vote it etc. — Varun Sharma, Mar 17 '14 at 20:12
Don't get me wrong. It's not a good question, so don't feel encouraged to not try better the next time. — Niklas B., Mar 17 '14 at 20:13
Ya at work - we have our unusual design guidelines etc. so need to use that - descriptive variable names, methods for everything, interfaces in separate file etc. etc. — Varun Sharma, Mar 17 '14 at 20:13
Oh okay - sure I will try to get it done though. And then I am also solving graph questions from UVA online judge and this website. It is a very long journey :-) — Varun Sharma, Mar 17 '14 at 20:16

amit · Accepted Answer · 2014-03-17T18:46:59.780

3

Find Strongly Connected Components (SCC) using Tarjan's algorithm (O(V+E)), and create the SCC graph.
Topologically sort the resulting SCC graph (it is a DAG).
From last to first, find the number of nodes reachable from each component.
Choose a node which is in a SCC that can reach maximal number of nodes.

Step 3 - elaboration:

(For clarification reasons I will denote a vertex in the original graph as 'node', and a vertex in the SCC graph as 'vertex').

In step 3 you want to find the number of nodes that are reachable from each vertex of your SCC. This can be done by explicitly finding this set, or by finding only the number of nodes:

Explicitly finding the set of nodes reachable from each vertex:
This is pretty much straight forward, each vertex has an associated set of nodes, and you need to find the set associated to each vertex by doing a union on all edges on your SCC graph leading from the current vertex.
Using inclusion/exclusion to find the number of nodes reachable:
Inclusion/exclusion is a technique used to count size of union of sets where the sets might have repeats in them. For example, if you have 2 sets, the size of their union is |A|+|B|- |A[intersection]B|.
For 3 sets A,B,C: |A|+|B|+|C|-|A[intersrction]B| - |A[intersection]C| - |B[intersection]C + |A[intersection]B[intersection]C|
(and so on)
Using inclusion/exclusion - the sets are the previous nodes, and the intersections are based on 2 different vertices that will later link themselves to the same vertex.

edited Mar 17 '14 at 18:46

answered Mar 17 '14 at 10:45

amit

175,853
27
231
333

Here's another idea: instead of sorting and calculating the number of reachable vertices in reverse topological order, you could just keep track of the number of vertices in each SCC, then for each DAG in the DAG forest of SCCs sum up all those values and chose a vertex from the sink of the DAG with the highest number of vertices in it. Not asymptotically faster, but maybe a bit harder to get wrong. – G. Bach Mar 17 '14 at 14:47
@amit The graph is already a DAG. But isn't step 3 the hard part? You don't mention how that is actually done although it is the problem core – Niklas B. Mar 17 '14 at 17:53
@NiklasB. Inclusion/exclusion on #nodes or uniting sets, basically. This answer is meant to give guidelines to the OP and not solve the entire question, at least until more effort is shown, and then I will elaborate on a specific part. – amit Mar 17 '14 at 18:00
Thanks Amit - I know step 1 because I have implemented Kosaraju's algorithm before. And I know how to implement step 2 as well (DFS and keep adding list of vertices to a list as they are done (removed from stack or after the recursive DFS for loop)). Sorry but I didn't get step 3. Say I have 5 SCC at step 1, so now do I need to find how many nodes can I visit in the graph from 5th SCC, then 4th SCC? But what's the purpose of second step then? Can you please elaborate on it? Also I didn't get as to what you meant by inclusion/exclusion on number of nodes? – Varun Sharma Mar 17 '14 at 18:30
1

@amit: Sorry even after thinking about some more time, I can't see how inclusion/exclusion would help. If you have outdegree > log (n+m), this will take more time than a single DFS, or did I miss something? Uniting sets via merge-by-rank is probably the way to go, but it seems like we need persistent search trees for that – Niklas B. Mar 17 '14 at 18:31
1

@NiklasB. Inclusion/exclusion will be `O(m^2)`, and explicit sets will be O(n*m) to my understanding - per vertex, where `n` is the number of nodes in the original graph and `m` is the number of vertices in the SCC graph. The inclusion exclusion needs to go over at most all links in the SCC graph from the current vertex to the end, which is O(m^2) worst case, and for each such link the computation done is O(1). However, note that this graph is likely to be much smaller than the original and the worst case is not very likely in average case. (continue in next comment) – amit Mar 17 '14 at 18:56
1

Let's assume that the node that has k nodes after it has in average k/2 edges (can modify for any other k). Assuming uniform distribution, the mean distance from this node will also be k/2. This yields to the fact that each of these have k/4 edges. By continuing it, you get k/2+k/4+...+1 < k, and the number of vertices you need to go over is O(m) according to this analysis, which in turn will lead to O(m*n) total computation time. Much better than O(n^3) provided by BFS from each node. – amit Mar 17 '14 at 18:57
@NiklasB. Note that in here `m<=n` - I did not use the usual notation of `m=|E|`. – amit Mar 17 '14 at 18:58
1

In the problem the graph is a DAG, so *m = n* with your notation. I have to admit I don't see how you are going to achieve O(n^2), since the only method that I know to apply inclusion/exclusion here is exponential to the outdegree of the node we are currently looking at (I would have to look at every subset of adjacent neighbors). Also the worst case *will* most definitely occur somewhere in the test cases – Niklas B. Mar 17 '14 at 19:00
Thanks Niklas and Amit ! I think I am getting what you are trying to say. Will try implementing it today and see how it goes. Will let you guys know if I get AC. Thanks a lot for your time. Actually I am learning graph algorithms but for me right now - tricky part is - reading a problem and realizing that it's SCC/Top Sort/Network flow etc. even though I know how to implement them individually/directly. However if a problem involves just bfs,dfs or shortest path - i can identify the underlyning problem easily. MST also gives me hard time sometimes even though I can implement Kruskal easily. – Varun Sharma Mar 17 '14 at 19:15
@NiklasB. You are half right. From examining it again, I do believe inclusion/exclusion will be less efficient than the set approach, but it won't be exponential in the outdegree, instead you will have to count the number of edges in each level of the inclusion/exclusion that lead to each vertex, and calculate `sum(#edgesLeadingTo(v) * nodes[v])`. It will add a factor of `*m` for my understanding, and will be indeed less efficient than the sets approach I believe. Regarding n=m, the answer tends to answer the posted question, not the linked one... – amit Mar 17 '14 at 19:19
@amit I realized that. Okay then, there's not a lot more to be said about this – Niklas B. Mar 17 '14 at 19:20
Hi Amit, I got AC using Niklas approach (he himself got AC as well - Rank 2). Basically start DFS from the set of nodes (X) but the key idea he taught me was - "I want to compute for every node x its set of reachable nodes. This set is the union of the reachable sets of its "children" nodes, so those need to be solved first." He did that using BITSET. https://icpcarchive.ecs.baylor.edu/index.php?option=com_onlinejudge&Itemid=8&page=problem_stats&problemid=4443&category=613 – Varun Sharma Mar 18 '14 at 21:12

Graph Algorithm / Disjoint Set

1 Answers1