Though I'd just update the start of this question for people that come across it in future. Regex was not the optimal solution for my particular problem, but trying to regex complicated and separated patterns (my logic from the start) in one go wasn't ideal.
The answer to the question as stated would be to try separate regexes I think, and 'filter out' the stuff needed.
My file could be worked on with the pandas.read_fwf()
solution for optimal results so I chose that as the full answer.
I'm sure this has been asked somewhere before but I can't find a question that is exactly trying to do what I want - so my apologies in advance.
TLDR How would you regex for several different patterns in a line that are not located next to each other, or properly delimited? Am I wrong to be trying to do this in one move?
I have some strings in a pretty verbose file (see end of post) that I want to pull out. I want multiple bits of information from different columns within a line (though they are not properly delimited).
I know I can get this into match.group()
which will be perfect (because I intend to use each element I pull out later in isolation), except I can't figure out how to match several substrings that are physically separated from other another in the string (unless trying to do this is one go is just wrong?).
I can extract the table part that I want with some simple regex no problem:
#!/usr/bin/python
import re
hhresult_file = sys.argv[1] # The above file
regex = re.compile(r'\s*\d{1,2}\s\w{4}_\w\s.*') # Will match the whole line (my first shot at the problem)
def main():
with open(hhresult_file, 'r') as result_fasta:
lines = result_fasta.readlines()
for line in lines:
match = re.search(regex,line)
if match:
print(match.group())
if __name__ == '__main__' :
main()
But I'm also trying to pull out the columns which read "Hit" "Prob" "E-Value" "P-Value".
I think I can synthesise the required regexes for each individual fields (there are some nuances like the switch between exponentiated SI values and floats for example).
What I don't know how to do is 'disregard' regions of the string? Specifically, I can't get the 'Hit' (= 3izo_F) and then the 'Prob' field because of the hit description in the intervening space.
I was trying to go about it with grouped regexes, but without being physically adjacent it doesn't work (something like these, though there may be errors in them):
regex = re.compile(r'''
(\w{4}_\w) # Match the hit
(\d{1,3}\.\d') # Match the probability score
(\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # E value as float/E-
(\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # Match SI or float P value
(\d+\.\d+) # Match the score
''',re.VERBOSE)
The file in question:
Query PAU_03380 PAU_03380 hypothetical protein 3919442:3920968 reverse MW:51681
Match_columns 508
No_of_seqs 1 out of 1
Neff 1.0
Searched_HMMs 37488
Date Mon May 23 20:23:54 2016
Command hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/prot_all/PAU_03380.faa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 5 -Z 5 -E 1E-03 -nocons -nopred -nodssp
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
2 3izo_F Fiber; pentameric pento 97.6 1.3E-07 3.4E-12 95.6 0.0 156 156-317 210-388 (581)
3 1ocy_A Bacteriophage T4 short 97.6 1.8E-07 4.7E-12 80.4 0.0 85 323-418 10-122 (198)
4 1v1h_A Fibritin, fiber protein 96.1 0.00011 3E-09 60.4 0.0 30 167-198 2-31 (103)
5 1v1h_A Fibritin, fiber protein 95.9 0.00019 5.1E-09 59.1 0.0 10 168-177 41-50 (103)
6 1pdi_A Short tail fiber protei 95.6 0.00041 1.1E-08 63.3 0.0 26 323-348 90-116 (278)
7 2xgf_A Long tail fiber protein 94.1 0.005 1.3E-07 55.1 0.0 31 318-348 22-52 (242)
8 1h6w_A Bacteriophage T4 short 84.7 0.25 6.7E-06 47.1 0.0 27 323-349 255-282 (312)
9 1qiu_A Adenovirus fibre; fibre 79.9 0.54 1.4E-05 44.4 0.0 24 92-115 7-30 (264)
10 3s6x_A Outer capsid protein si 72.0 1.3 3.4E-05 43.6 0.0 69 106-191 44-112 (325)
No 1
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=98.13 E-value=2.7e-09 Score=107.58 Aligned_cols=65 Identities=22% Similarity=0.362 Sum_probs=42.7
Q PAU_03380 93 PLILKDDVLSVDLGSGLTNETNGICVGQGDGITVNTSNVAVKQGNGISVTSSGGVAVKVSANKGLSVD 160 (508)
||-+.++-|.++....|+...+++.+--+++++|+.....++....++++ .+++++++. .||.++
T 3izo_F 104 PLTVTSEALTVAAAAPLMVAGNTLTMQSQAPLTVHDSKLSIATQGPLTVS-EGKLALQTS--GPLTTT 168 (581)
Confidence 55555556666666667777777777777777777776777777777764 566666554 355554
No 2
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=97.60 E-value=1.3e-07 Score=95.57 Aligned_cols=156 Identities=19% Similarity=0.323 Sum_probs=85.6
Q PAU_03380 156 GLSVDSSGVAVKVNTDKGISVDGNGVAVKVNTSKGISVDNTGVAVIANASKGISVDGSGV--------------AVIANT 221 (508)
.|.+..++-.+.+++..|+.|.++.+.+|+ ..++.+++.|- +-.+...|+.++...- .+..+.
T 3izo_F 210 PLHVTDDLNTLTVATGPGVTINNTSLQTKV--TGALGFDSQGN-MQLNVAGGLRIDSQNRRLILDVSYPFDAQNQLNLRL 286 (581)
Confidence 344544434556666667777666655443 23333333221 1111222333332211 234445
It goes on a bit but is just more of the above 2 alignments.
UPDATE 1
Just to provide an example of what I'd ideally like at the end:
Given the line in the 'short table':
1 3izo_F Fiber; pentameric pento 98.1 2.7E-09 7.3E-14 107.6 0.0 65 93-160 104-168 (581)
I'd like to get either a delimited string, or separate match.group
for:
The PDB Hit ID == 3izo_F
Each of the first 4 metrics (as separate groups ideally, but I could deal with that after the fact) = 98.1
2.7E-09
7.3E-14
107.6
Such a shame this program doesn't just provide a proper tabular output :(