0

I am trying to merge two text-format (PDB) files. One (bigger one) contains full set of data describing the protein, second one contains very small set of data changing just small part (set of coordinates).

Example:

Basic file (part):

ATOM    605  CD2 LEU A  92      11.727  14.051  55.011  1.00 75.51      4pxz C  
ATOM    606  N   ARG A  93      10.555  10.636  58.260  1.00 62.79      4pxz N  
ATOM    607  CA  ARG A  93      11.357   9.429  58.493  1.00 59.89      4pxz C  
ATOM    608  C   ARG A  93      10.429   8.207  58.562  1.00 62.83      4pxz C  
ATOM    609  O   ARG A  93      10.760   7.168  57.994  1.00 61.39      4pxz O  
ATOM    610  CB  ARG A  93      12.236   9.564  59.757  1.00 58.23      4pxz C  
ATOM    611  CG  ARG A  93      13.088   8.333  60.120  1.00 60.51      4pxz C  
ATOM    612  CD  ARG A  93      13.985   7.822  58.995  1.00 61.21      4pxz C  
ATOM    613  NE  ARG A  93      14.503   6.485  59.295  1.00 60.36      4pxz N  
ATOM    614  CZ  ARG A  93      15.012   5.642  58.400  1.00 66.21      4pxz C  
ATOM    615  NH1 ARG A  93      15.074   5.979  57.116  1.00 52.54      4pxz N  
ATOM    616  NH2 ARG A  93      15.455   4.453  58.780  1.00 48.93      4pxz N  
ATOM    617  N   THR A  94       9.247   8.357  59.192  1.00 60.68      4pxz N  
ATOM    618  CA  THR A  94       8.227   7.305  59.271  1.00 59.92      4pxz C

Auxillary file (with set of coordinates to be substituted):

ATOM     39  CA  ARG A  93      11.357   9.429  58.493  1.00 59.89      hatp C  
ATOM     40  CB  ARG A  93      12.236   9.564  59.757  1.00 58.23      hatp C  
ATOM     41  CG  ARG A  93      11.569   9.166  61.087  1.00 60.51      hatp C  
ATOM     42  CD  ARG A  93      12.319   8.102  61.886  1.00 61.21      hatp C  
ATOM     43  NE  ARG A  93      11.978   6.754  61.425  1.00 60.36      hatp N  
ATOM     44  CZ  ARG A  93      11.731   5.714  62.217  1.00 66.21      hatp C  
ATOM     45  NH2 ARG A  93      11.430   4.535  61.694  1.00 48.93      hatp N  
ATOM     46  NH1 ARG A  93      11.793   5.843  63.538  1.00 52.54      hatp N  

Expected outcome: -> Changed coordinates <-

ATOM    604  CD1 LEU A  92       9.685  13.033  54.000  1.00 73.10      4pxz C
ATOM    605  CD2 LEU A  92      11.727  14.051  55.011  1.00 75.51      4pxz C
ATOM    606  N   ARG A  93      10.555  10.636  58.260  1.00 62.79      4pxz N
ATOM    607  CA  ARG A  93   -> 11.357   9.429  58.493<- 1.00 59.89      4pxz C
ATOM    608  C   ARG A  93      10.429   8.207  58.562  1.00 62.83      4pxz C
ATOM    609  O   ARG A  93      10.760   7.168  57.994  1.00 61.39      4pxz O
ATOM    610  CB  ARG A  93   -> 12.236   9.564  59.757<- 1.00 58.23      4pxz C
ATOM    611  CG  ARG A  93   -> 11.569   9.166  61.087<- 1.00 60.51      4pxz C
ATOM    612  CD  ARG A  93   -> 12.319   8.102  61.886<- 1.00 61.21      4pxz C
ATOM    613  NE  ARG A  93   -> 11.978   6.754  61.425<- 1.00 60.36      4pxz N
ATOM    614  CZ  ARG A  93   -> 11.731   5.714  62.217<- 1.00 66.21      4pxz C
ATOM    615  NH1 ARG A  93   -> 11.793   5.843  63.538<- 1.00 52.54      4pxz N
ATOM    616  NH2 ARG A  93   -> 11.430   4.535  61.694<- 1.00 48.93      4pxz N
ATOM    617  N   THR A  94       9.247   8.357  59.192  1.00 60.68      4pxz N
ATOM    618  CA  THR A  94       8.227   7.305  59.271  1.00 59.92      4pxz C

I tried doing so by means of:

  • Building a lists and appending every line as single entry for both files

  • Extracting atom type, name of residue, chain and residue number from both files (eg. CD1 LEU A 92, respecitvely) and appending to another lists

  • Comparing extract lists

  • Writing a file containing mixed lists from point 1, accordingly to point 3.

Code:

import re

aminoacid_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")
coords_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")

class fileSaver:
    protein = "4pxzclean.pdb"
    flexres = "ARGA93.pdb.tmp"
    def __init__(self):
        pass

    def aminoacid_to_substitute(self, flexres, data = []):

        with open(flexres, 'r') as flex:
                for line in flex:
                    if aminoacid_pattern != None:
                        data.append(line)
        return data

    def parse_rigid(self, rigidprot, test = []):

        with open(rigidprot, 'r') as rigid:
            for line in rigid:
                if aminoacid_pattern != None:
                    test.append(line)
        return test

class fileComparer:
    def __init__(self):
        pass

    def compare_data(self, data_flex, data_rigid, cleanflex = [], cleanrigid = []):

        for el in data_flex:
            if aminoacid_pattern != None:
                cleanflex.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))


        for el in data_rigid:
            if aminoacid_pattern != None:
                cleanrigid.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))

        with open("test.txt", 'a+') as test:
            for rig_el in data_rigid:
                for flex_el in data_flex:
                    for rg_el in cleanrigid:
                        if rg_el not in cleanflex:
                            test.write(rig_el)
                        if rg_el in cleanflex:
                            test.write(flex_el)



if __name__ == '__main__':
    initialize = fileSaver()
    flex = initialize.aminoacid_to_substitute("ARGA93.pdb.tmp")
    rigid = initialize.parse_rigid("4pxzclean.pdb")
    comparer = fileComparer()
    comparer.compare_data(flex,rigid)

Unfortunately, it gives infinitely long file, without any changed line. Could you tell me where is a mistake?

smci
  • 32,567
  • 20
  • 113
  • 146
MadEye
  • 19
  • 3
  • general note: your input data seems to be some sort of csv, wouldn't it be easier to just use a csv parser that comes with python (/pandas) for file i/o? e.g. [csv package](https://docs.python.org/3/library/csv.html) or [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) – FObersteiner Oct 15 '19 at 09:07
  • 2
    @MrFuppes it's [PDB format](https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)). But I agree that it'd be easier to use a dedicated parser: from [BioPython](https://biopython.org/DIST/docs/api/Bio.PDB-module.html), [BioPandas](https://rasbt.github.io/biopandas/), [cctbx](https://cci.lbl.gov/cctbx_docs/iotbx/iotbx.pdb.html) or [gemmi](https://gemmi.readthedocs.io/en/latest/mol.html) (disclaimer: the last one is written by me). – marcin Oct 15 '19 at 09:20
  • @marcin: nice, I'm not too much into biochemistry, so please apologize my unfamiliarity with the format ;-) on topic: maybe another general thing regarding comparison performance: `regex` tends to be quite expensive. I'd suggest to e.g. have a look at [frozensets](https://docs.python.org/3/library/stdtypes.html#frozenset). These are pretty efficient for comparisons like `if ... in ...`. Or maybe one of the suggested lilbraries has something at hand? – FObersteiner Oct 15 '19 at 09:30
  • We can't reproduce without your data, also the question needs more focus. Or maybe it's obsoleted if you use the PDB readers @marcin suggested. If this is still an issue, please make your question reproducible (MCVE), also be more specific about what's not working. – smci May 31 '20 at 02:54

0 Answers0