edit While I think my original code (now at the bottom) was good, I think you can do better using node_connected_component(G,u)
which returns the set of nodes in the same component as u
.
Let me explain the code below. First, I'm going to step through the interesting genes. With the first one, I look for the component of G
in which it sits, and then find all of the other interesting genes that are in the same component. Then when I look at each subsequent interesting gene, I make sure I haven't already encountered it in the same component as another. If it's a new component, I find all of the other interesting genes in the same component. At the end, I find the component that had the most interesting genes.
component_count = {}
seen = {}
for source in interesting_genes:
if source not in seen:
reachable_nodes = nx.node_connected_component(G, source)
reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes]
for node in reachable_nodes_of_interest:
seen[node] = True
component_count = len(reachable_nodes_of_interest)
source = max(component_count, key=component_count.get) #finds the node with the largest component_count
Gprime = G.subgraph(nx.node_connected_component(G, source))
Take a look at the documentation for single_source_shortest_path_length
.
seen = {source:False for source in interesting_genes} #if we've found the component of a node, no need to recalculate it.
component_count = {} #will count how many other interesting genes there are in a component
for source in interesting_genes:
if not seen[source]:
reachable_nodes_dict = nx.single_source_shortest_path_length(G, node)
reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes_dict]
for target in reachable_nodes_of_interest:
seen[target] = True
component_count[source] = len(reachable_nodes_of_interest)
source = max(component_count, key=component_count.get) #finds the node with the largest component_count
Gprime = G.subgraph(nx.node_connected_component(G, source))