2

Two molecules have different substructures in bit-1 (see image attached below).

If I have a large number of molecules, how to force them to have the same substructure in the same bit? I want to use them for Machine Learning, so I have to ensure that all these fingerprint vectors share the same substructure information.

Example: bit info of two molecules

Zihao Wang
  • 45
  • 4

1 Answers1

2

What you are experiencing here is a bit collision due to the fact your fingerprint is only three bits long. Three bits is not enough capacity to contain all of the substructures generated in the Morgan algorithm hence the same bit is being assigned multiple times (to different substructures). These molecules are quite small so increasing the length of the bit vector to 12 solves the issue you show here since there is now enough capacity in the fingerprint. When using fingerprints normally you would expect that fingerprints have >= 1024 bits in order to balance size with enough capacity to minimize these bit collisions.

Oliver Scott
  • 1,673
  • 8
  • 17