3

I wrote a script to retrieve and treat information from the Protein Data Base. I import the MMCIFDict module from Bio.PDB.MMCIF2Dict which allows to parse the CIF data in a dictionary. It works well for almost all structures of my list but, I don't know why, for some it crashes. For example with the PDBid 4asd, it returns the key instead of the value, and the value instead of the key. It is like if the parser flips the attribution of keys and values.

The only solution I found is to check if the expected key from the dictionary generated by MMCIFDict module exists or not. If not, I have to find it in all the values of the corresponding dictionary.

import urllib.request
from Bio.PDB.MMCIF2Dict import MMCIF2Dict

set the list of pdb id. Here an example with 4asd

pdb_list = ['4asd']

retrieve the data

cif_webpage = urllib.request.urlopen(f'https://files.rcsb.org/header/{pdb}.cif').read().decode('utf-8').split('\n')

create the dictionary

dico = MMCIF2Dict(cif_webpage)

What I expect:

dico['_entity_src_gen.pdbx_gene_src_scientific_name'] == 'HOMO SAPIENS'

What I have:

KeyError: '_entity_src_gen.pdbx_gene_src_scientific_name'

The expected key is not a key but the value of the expected value which is now a key (hope I don't lost you):

dico['HOMO SAPIENS'] == '_entity_src_gen.pdbx_gene_src_scientific_name'

Thanks in advance for your help!

Fan
  • 27
  • 3
  • Can you provide a working accession for testing? Also, can you track "_audit_conform.dict_version" values for working and non-working files? – Ghoti Aug 28 '19 at 16:08
  • if you have pdb = '4kik', it will return the correct values for '_entity_src_gen.pdbx_gene_src_scientific_name', which are for this example ['Homo sapiens', 'Homo sapiens']. With pdb = '4kik4, dico["_audit_conform.dict_version"] returns: '5.279'. With the structure that doesn't work correctly, it retuns : '5.308' – Fan Sep 02 '19 at 12:37

1 Answers1

1

What you pass to MMCIF2Dict is a list of strings, while according to docs the argument should be:

file - name of the PDB file OR an open filehandle

I downloaded header/4asd.cif and verified that it works fine if the argument is a file handle.


Alternatively, you could use gemmi to parse a CIF file (disclaimer: I'm working on this project)

from gemmi import cif
import urllib.request

pdb = '4asd'
with urllib.request.urlopen(f'https://files.rcsb.org/header/{pdb}.cif') as c:
    doc = cif.read_string(c.read())
category_dict = doc[0].get_mmcif_category('_entity_src_gen')
assert category_dict['pdbx_gene_src_scientific_name'] == ['HOMO SAPIENS']
marcin
  • 3,351
  • 1
  • 29
  • 33
  • http://mmcif.wwpdb.org/docs/sw-examples/python/html/index.html is another parser available, not a lot of documentation tought – pippo1980 Nov 09 '20 at 12:45