2

I have a graph containing many genes and their interactions. I am interested in finding a subgraph with the maximum no .of a specific set of genes say, A, B, C, D, E in the graph.

Tried BFS algorithm and also connected components. But do not know how to find the subgraph of my genes of interest.

def bfs(G, gene, n):
"""

Using breadth-first search
returns a graph of breadth n starting at source gene.
"""
S = nx.algorithms.traversal.breadth_first_search.bfs_tree(G, source=gene, depth_limit=n)
return S

Given a graph G(V,E) with V vertices and E edges , I want to find a subgraph G'(v,e) where v is a subset of V, such that G' contains maximum of my nodes on interest.

yathrakaaran
  • 179
  • 1
  • 3
  • 15
  • 1
    Based on the rest of your question, I assume that final subgraph `G'` needs to be connected? – Joel Aug 02 '19 at 01:42
  • @Joel, yes you are right. – yathrakaaran Aug 02 '19 at 02:18
  • 1
    Please note, I've updated my answer with a better version (and corrected an error at the very end where it found a set of nodes rather than the subgraph consisting of that set of nodes). – Joel Aug 02 '19 at 04:51
  • 1
    As an additional comment - I'm not sure that this is necessarily the best approach to study these. Gene interaction networks are often noisy, and are likely to be missing links. So once you find the component given your data, it may well be that a link we learn about next year changes your answer. So you may want to think about measures that are less sensitive to edge/node addition/deletion. – Joel Aug 03 '19 at 11:54
  • Thank you Joel for your answer and comments. I was working with it and realized what you said, is absolutely true. I am very new to graphs and the solution you gave is generating a huge graph with about 2000 nodes and 200,000 connections. I need to somehow reduce it and visualize it to make sense of it better. – yathrakaaran Aug 03 '19 at 14:47
  • What I have is a directed graph and so I was converting it to an undirected graph to use the node_conneted_component function. Is there another solution to select subgraphs with interesting nodes (genes) from a directed graph? Or else, are there any ways to reduce the graph size? – yathrakaaran Aug 03 '19 at 14:53

1 Answers1

3

edit While I think my original code (now at the bottom) was good, I think you can do better using node_connected_component(G,u) which returns the set of nodes in the same component as u.

Let me explain the code below. First, I'm going to step through the interesting genes. With the first one, I look for the component of G in which it sits, and then find all of the other interesting genes that are in the same component. Then when I look at each subsequent interesting gene, I make sure I haven't already encountered it in the same component as another. If it's a new component, I find all of the other interesting genes in the same component. At the end, I find the component that had the most interesting genes.

component_count = {}
seen = {}
for source in interesting_genes:
    if source not in seen:
        reachable_nodes = nx.node_connected_component(G, source)
        reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes]
        for node in reachable_nodes_of_interest:
            seen[node] = True
        component_count = len(reachable_nodes_of_interest)

source = max(component_count, key=component_count.get) #finds the node with the largest component_count


Gprime = G.subgraph(nx.node_connected_component(G, source))

Take a look at the documentation for single_source_shortest_path_length.

seen = {source:False for source in interesting_genes}  #if we've found the component of a node, no need to recalculate it.
component_count = {}  #will count how many other interesting genes there are in a component
for source in interesting_genes:
    if not seen[source]:
        reachable_nodes_dict = nx.single_source_shortest_path_length(G, node)
        reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes_dict]
        for target in reachable_nodes_of_interest:
            seen[target] = True
        component_count[source] = len(reachable_nodes_of_interest)
source = max(component_count, key=component_count.get) #finds the node with the largest component_count


Gprime = G.subgraph(nx.node_connected_component(G, source))
Joel
  • 22,598
  • 6
  • 69
  • 93