1

I got a data format like:

ATOM 124 N GLU B 12
ATOM 125 O GLU B 12
ATOM 126 OE1 GLU B 12
ATOM 127 C GLU B 12
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
ATOM 133 C GLU B 15
ATOM 134 CA GLU B 15
ATOM 135 OE2 GLU B 15
ATOM 136 O GLU B 15
             .....100+ lines

From here, I want to filter this data based on col[5] (starting column count from 0) and col[2]. Per value of col[5] if OE1 or OE2 happens to be only once then the data set to be discarded. But for each value of col[5] if OE1 and OE2 both be present, it would be kept.
The desired data after filtering:

ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14

I have tried using search_string like:

for item in stored_list:
    search_str_a = 'OE1'+item[3]+item[4]+item[5]
    search_str_b = 'OE2'+item[3]+item[4]+item[5]
    target_str = item[2]+item[3]+item[4]+item[5]

This is helpful to maintain rest of the col alike while searching for OE1 or OE2, but not helpful to filter and eliminate if one of them(or both them) is missing.

Any ideas would be really nice here.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
diffracteD
  • 758
  • 3
  • 10
  • 32
  • So did you want to keep `col[5]=14`? – Mazdak Aug 09 '15 at 05:35
  • for a single value of `col[5]` we need to search `col[3]` and after making decisions regarding keeping data or discarding(based on the presence of both `OE1` and `OE2`), we iterate the value of `col[5]` and keep searching same way @Avinash Raj This is my so far idea. – diffracteD Aug 09 '15 at 05:40
  • 1
    You should mention that you're parsing PDB files, as there *are* bioinformaticians out there. In this case, check out Biopython's [`Bio.PDB`](http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ) module. I don't have the slightest clue what information you're trying to retrieve here, as your question is rather unclear, but reading through the linked FAQ will help you greatly. – MattDMo Aug 09 '15 at 05:40
  • @Kasramvd yes, as it has both `OE1` and `OE2`, as per desired output. – diffracteD Aug 09 '15 at 05:42
  • What I got from your question is that you want to keep lines with a special `col[5]` and unique `col[3]` right? – Mazdak Aug 09 '15 at 05:44
  • @ Kasramvd Precisely if both `OE1` and `OE2` belongs to a single `value` of `col[5]` then the line to be kept. If any single one of them(`OE1` or `OE2`) is missing(per value) then we discard the line. – diffracteD Aug 09 '15 at 05:49

2 Answers2

2

The below code needs pandas you can download it from http://pandas.pydata.org/pandas-docs/stable/install.html

import pandas as pd

file_read_path = "give here source file path"
df = pd.read_csv(file_read_path, sep= " ", names = ["col0","col1","col2","col3","col4","col5"])
group_series =  df.groupby("col5")["col2"].apply(lambda x: "%s" % ', '.join(x))

filtered_list = []
for index in group_series.index:
    str_col2_group = group_series[index]
    if "OE1" in str_col2_group and "OE2" in str_col2_group:
        filtered_list.append(index)

df = df[df.col5.isin(filtered_list)]
output_file_path = "give here output file path"
df.to_csv(output_file_path,sep = " ",index = False,header = False)

this would be helpfull http://pandas.pydata.org/pandas-docs/stable/tutorials.html

Output result

ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
Romil Shah
  • 95
  • 8
0

using csv, it comes with python

import csv
import operator

file_read_path = "give here source file path"
with open(file_read_path) as f_pdb:
    rdr = csv.DictReader(f_pdb,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
    sorted_bio = sorted(rdr,key=operator.itemgetter('col5'),reverse=False)
    col5_tmp = None
    tmp_list = []
    perm_list = []
    tmp_str = ""
    col5_v = ""
    for row in sorted_bio:
        col5_v = row["col5"]
        if col5_v != col5_tmp:
            if "OE1" in tmp_str and "OE2" in tmp_str:
                perm_list.extend(tmp_list)
            tmp_list = []
            tmp_str = ""
            col5_tmp = col5_v
        tmp_list.append(row)
        tmp_str = tmp_str +","+ row["col2"]

    if col5_v != col5_tmp:
        if "OE1" in tmp_str and "OE2" in tmp_str:
            perm_list.extend(tmp_list)


csv_file = open("give here output file path","w")
dict_writer = csv.DictWriter(csv_file,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
for row in perm_list:
    dict_writer.writerow(row)
csv_file.close()
Romil Shah
  • 95
  • 8