1

I am not a bioinformatician and my question may sound basic.

I have some issues with RDKit The issue: there are some sequences that have X in the antimicrobial peptide sequence. Seems that RDKit cannot process these cases. For example the following sequences: seq = 'HFXGTLVNLAKKIL', 'HFLGXLVNLAKKIL', 'HFLGTLVNXAKKIL', 'fPVXLfPXXL', 'SRWPSPGRPRPFPGRPKPIFRPRPXNXYAPPXPXDRW'...], and the Chem.MolFromSequence(seq[i]) returns None for these cases.

My question is how do deal with this kind of sequence?

S.EB
  • 1,966
  • 4
  • 29
  • 54

1 Answers1

1

Let me explain the reason for the output of None

As you can see in this list of abbreviations for peptide sequences the letter "X" stands for "unknown". Basically the real amino acid could not be discovered there. Therefore RDKit can not create a mol object of your data, because parts of it are unknown.

RETURNS:

a Mol object, None on failure.

Source of quote above

Since RDKit's managing of this case is logically reasonable you have to answer your question yourself: "How do I deal with unknown amino acids?". You need a preprocessing of those sequences and maybe replace the "X" with something else, or delete that sequence entirely from your dataframe. But this depends on your own usecase.

Tarquinius
  • 1,468
  • 1
  • 3
  • 18