1

TLDR: How do you use node_match attributes to get NetworkX to recognise C+ and C atoms as different?

Here is an example of a pair of molecules I have calculated GED for. enter image description here

I got a value of 0 for the GED using the following code:

import networkx as nx

def get_graph(mol): 
  atoms = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
  am = Chem.GetAdjacencyMatrix(mol,useBO=True)
  for i,atom in enumerate(atoms):
    am[i,i] = atom
  G = nx.from_numpy_matrix(am)
  return G

G1 = get_graph(mol1)
G2 = get_graph(mol2)

GED= nx.graph_edit_distance(G1, G2, edge_match=lambda a,b: a['weight'] == b['weight'])

print(GED) 

So my understanding of the edge_match=lambda in this case is that it is being used to distinguish between single bonds and double bonds, is this correct? I believe this to be the case because when I run the code for propene and propane it gives a GED of 1, which to me would signify the change of the edge (double bond to single bond). However, I believe that the reason that this code gives a GED of 0 for these two molecules is because it is considering the C+ and C atoms to be the same? Therefore considering the two structures as identical. How would I encode for the graph structure to recognise the C+ and C as different? I have been reading the NetworkX documentation for atom_match attributes but I really don't understand how I can use this to do what I want to do. If this isn't the solution then would I have to encode the Hydrogen numbers somehow?

(Side note: When using the same code for the same structures but with B in place of C, it gives a GED of 2, which I believe is because the B is set as BH where C is just C+. Picture of molecules below) enter image description here

BanAckerman
  • 103
  • 1
  • 8

1 Answers1

1

The reason networkx considers both C+ and C atoms as the same is because you are feeding atomic numbers (which doesn’t change irrespective of the charge) to the adjacency matrix in this line:

am[i,i] = atom

There are two methods to make networkx differentiate between C+ and C.

Method 1: (Naive Method)

Adding Hydrogens to all the carbon atoms: This method takes way too long to output the GED (more than 40 mins). So I do not think this method is efficient.

Method 2: (Hacky Method)

In this method, we capture information regarding the formal charge on the atoms, and on id of the C+ atom we add the formal charge to the atomic number and feed the sum to the adjacency matrix, in essence making networkx differentiate between C and C+ atoms since they now have different "atomic numbers" in the adjacency matrix. This method outputs the expected GED of 2.0.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG=True  
import networkx as nx

smiles_1 = 'CC(C)(C)[C+](C=C)(C=C)'
smiles_2 = 'CC(C)(C)C(C=C)(C=[C+])'

mol1 = Chem.MolFromSmiles(smiles_1)
mol2 = Chem.MolFromSmiles(smiles_2)

def get_graph(mol): 
  atomic_nums = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
  formal_charges = [atom.GetFormalCharge() for atom in mol.GetAtoms()]
  ad_matrix = Chem.GetAdjacencyMatrix(mol,useBO=True)
  for i,(a_num,f_c) in enumerate(zip(atomic_nums, formal_charges)):
    if f_c !=0:
      ad_matrix[i,i] = a_num + f_c
    else:
      ad_matrix[i,i] = a_num
  G = nx.from_numpy_array(ad_matrix)
  return G

G1 = get_graph(mol1)
G2 = get_graph(mol2)

GED= nx.graph_edit_distance(G1, G2, edge_match=lambda a,b: a['weight'] == b['weight'])

print((GED) )
# >> Outputs 2.0

Also note: from_numpy_matrix is deprecated in the latest version of networkx (version 3.0). So I have used from_numpy_array instead.

Vandan Revanur
  • 459
  • 6
  • 17