How to find the min set of vertices in multiple potentially independent UNdirected Graphs that captures the largest total cost

Question

I have a large set of vertices/nodes that represent a set of graphs. Note that there could be many independent graphs within this complete set. The goal is to find the min number of vertices across all these graphs that correspond to the largest total sum of weights across all the edges captured by those selected vertices. I have the adjacency matrix in pandas and I am using networkx.

Below is a sample dataframe with the three columns where Number_Of_Trips is the weight. I could provide a weight of node = 10*trips in order to merge the two metrics together. I.e. maximizing # of Trips - 10*NumberOfNodes

    Number_Of_Trips dropoff_gh7 pickup_gh7
0   304 9tbqhsx 9tbqj4g
1   271 9tbqj4f 9tbqhsx
2   263 9tbqt4s 9tbqhsx
3   258 9tbqdye 9tbqdsr
4   256 9tbqhgh 9tbqjfv
5   236 9tbqhsw 9tbqj4g
6   233 9tbqt4g 9tbqv03
7   229 9tbqhsx 9tbqj4c
8   218 9tbqy3f 9tbqt4s
9   213 9tbq5v4 9tbqh41
10  210 9tbqhgh 9tbqhsw
11  192 9tbqhgh 9tbqje4
12  186 9tbqy3f 9tbqt4g
13  184 9tbqhgh 9tbqj4z
14  183 9tbqe3d 9tbqe9e
15  170 9tbq3xn 9tbq39w
16  167 9tbq5bw 9tbqht6
17  163 9tbqhsx 9tbqh0x
18  162 9tbqdk1 9tbq7p2
19  160 9tbqsch 9tbqt4s

x = nx.from_pandas_dataframe(df,"dropoff_gh7","pickup_gh7","Number_Of_Trips")
graphs = list(nx.connected_component_subgraphs(x))

You cite conflicting evaluation criteria; we need you to define the problem. We also need you to describe your research, and how the available algorithms don't fulfill your need. I would think that Dijkstra's algorithm could be readily adapted to your cause, provided your cost function is nicely behaved. — Prune, Oct 18 '17 at 19:03
Please include a sample data set and what you have tried so far, then we might be able to help. Read the help about how to create a [a minimal, complete, and verifiable example](https://stackoverflow.com/help/mcve). — joelostblom, Oct 18 '17 at 21:41
@JoelOstblom - i added clarification. ALso, to answer Prune's question, I cant seem to find any algorithm that does this inherently since its a subgraph that has the maximum total weights but least number of nodes. Djikstra takes in a specific source and destination and spanning tree's enforce a path that is through all the nodes. THese dont meet my need. Hence my question. — SriK, Oct 19 '17 at 07:36
@Prune please note above about the clarification to your issues. — SriK, Oct 19 '17 at 07:36
You have two objective functions, this isn't doable. What should be "Value" of a subgraph with 5 nodes and total 60 cost? is it better than one with 6 nodes and 70 cost? — Shihab Shahriar Khan, Oct 19 '17 at 15:09
I suggested a merge of the two objectives in the question - I can define the combination by maximizing # of Trips - 10*NumberOfNodes — SriK, Oct 19 '17 at 15:46
@SriK: Finally! Putting that into an equation is what we needed to complete the problem description. Now, simply apply Dijkstra's algorithm, using longest path instead of shortest as the criterion. — Prune, Oct 19 '17 at 16:30
@Prune thanks! But wait, what if I want the subgraph within the graph that has the maximum of the objective function? The path and subgraph might be quite different right ? Doesnt djikstra take a source and destination as well? I dont have that — SriK, Oct 19 '17 at 17:58
Yes, you can specify source and destination. If you want to simplify, make dummy source and destination nodes, with epsilon moves to link to the true ones. You given graph has 12 ultimate source nodes and 10 destinations. Six pairs of nodes are trivial (2 nodes and one edge), and any given source => destination pair has only one path. In short, this is not a difficult graph. — Prune, Oct 19 '17 at 18:27
@Prune the data i gave is obviously a trivial example and not the full dataset. Hence the question. I need an approach that somewhat scales. Also, there is a significant difference between a path and a subgraph. What you give me back is a path. What i need is a subgraph. Thanks for your input of course. — SriK, Oct 19 '17 at 20:54
Ah. I will back up and try again ... so the simple, greedy algorithm is to pick a starting point and perform a closure operation through all nodes with at least 10 trips. Afterward, we can look at joining any graphs where the "under 10" connection will provide a net profit. — Prune, Oct 19 '17 at 21:00
@Prune Do you have a link or example of this algorithm ? I am not familiar with it — SriK, Oct 20 '17 at 20:20
No, I don't have an example. It's simple closure (with a minimal definition of "connected"), with extension checking afterward. I've always coded this stuff myself. — Prune, Oct 20 '17 at 20:47

score 1 · Answer 1 · answered Oct 20 '17 at 21:13

Here's an outline of the logic.

Create a cluster structure. A cluster has member nodes, an internal value (total internal trips), and edges to other clusters.

Start with each node in an individual cluster. Put all of these clusters into a "not done" list. You're now going to iterate through that list, merging clusters where you find advantage in doing so. Pick the first cluster in the list.

Iterate: For every edge of that cluster, check the net value of merging the cluster at the other end of that edge: internal trips + edge trips - 10*cluster population (quantity of nodes).

Merge: Concatenate the member-node lists of the two clusters. Add their internal values and the value of the edge between them. Adjust for the node population (if you're not already doing that accounting elsewhere). Merge the lists of edges to other clusters. Remove the merged cluster from the "not done" list.

Continue with this "Kleene Closure" process until you have no more nodes to add profitably. Move this resulting cluster to the "done" list. Pick the next node in the "not done" list and repeat the iterate & merge loop until the "done" list is empty.

Now, move the entire "done" list back to the "not done" list and repeat the process until you complete a pass with no further merges.

Is that detailed enough for you to code the process?

Im certain your approach may work though I was not able to convert it to code as I didnt follow all of it. Nevertheless, I posted a solution that is not optimal but gives pretty decent results that was based on clique analysis. Thank you for your input though. Appreciate the guidance! If you have any code of your idea, would be great to see it too. — SriK, Oct 31 '17 at 17:31

score 0 · Accepted Answer · answered Oct 31 '17 at 17:28

Note that one caveat to the question is that you could have multiple independent subgraphs within the graph that could be the solution. The key intuition to this solution is that the most likely candidates for subgraphs are the vertices that share a lot of edges with each other. It turns out that this is precisely what is evaluated when looking at Cliques in a graph. Hence, this solution simply extracts all the cliques and then rank orders them by the total num of weights represented by the vertices in the clique - the number of vertices * cost of a vertice. This can be quickly prototyped using NetworkX.

G = nx.from_pandas_dataframe(df, "dropoff_gh7", "pickup_gh7", ['num_of_trips'])
# Find all the cliques in the graph (not only maximal but all sub cliques as well. Note that clique finding is NP complete so this may take a long time if your graph is > 100k of edges or more. For <100k edges, this took within 5 mins on a 16GB macbook pro 3GHz machine.
cliques = nx.find_cliques(G)
clique_trips = [np.array([c,G.subgraph(c).size(weight="num_of_trips")]) for c in cliques]
df_cliques = pd.DataFrame(clique_trips,columns=["vertices","num_of_trips"])
df_cliques["num_vertices"] = df_cliques.apply(lambda x:len(x[0]), axis=1)
df_cliques["weighted_trips"] = df_cliques.apply(lambda row: 
    row["num_of_trips"] - row["num_vertices"]*COST_PER_NODE, axis=1)
df_cliques = df_cliques.sort_values("weighted_trips")[::-1]
df_cliques.head()
# The top N cliques can then be aggregated into a set to identify the precise vertices that are most valuable.

How to find the min set of vertices in multiple potentially independent UNdirected Graphs that captures the largest total cost

2 Answers2