python regex separated elements of a string

Question

Though I'd just update the start of this question for people that come across it in future. Regex was not the optimal solution for my particular problem, but trying to regex complicated and separated patterns (my logic from the start) in one go wasn't ideal. The answer to the question as stated would be to try separate regexes I think, and 'filter out' the stuff needed. My file could be worked on with the pandas.read_fwf() solution for optimal results so I chose that as the full answer.

I'm sure this has been asked somewhere before but I can't find a question that is exactly trying to do what I want - so my apologies in advance.

TLDR How would you regex for several different patterns in a line that are not located next to each other, or properly delimited? Am I wrong to be trying to do this in one move?

I have some strings in a pretty verbose file (see end of post) that I want to pull out. I want multiple bits of information from different columns within a line (though they are not properly delimited).

I know I can get this into match.group() which will be perfect (because I intend to use each element I pull out later in isolation), except I can't figure out how to match several substrings that are physically separated from other another in the string (unless trying to do this is one go is just wrong?).

I can extract the table part that I want with some simple regex no problem:

#!/usr/bin/python

import re
hhresult_file = sys.argv[1] # The above file

regex = re.compile(r'\s*\d{1,2}\s\w{4}_\w\s.*') # Will match the whole line (my first shot at the problem)

def main():
    with open(hhresult_file, 'r') as result_fasta:
        lines = result_fasta.readlines()
        for line in lines:
            match = re.search(regex,line)
            if match:
                print(match.group())

if __name__ == '__main__' :
    main()

But I'm also trying to pull out the columns which read "Hit" "Prob" "E-Value" "P-Value".

I think I can synthesise the required regexes for each individual fields (there are some nuances like the switch between exponentiated SI values and floats for example).

What I don't know how to do is 'disregard' regions of the string? Specifically, I can't get the 'Hit' (= 3izo_F) and then the 'Prob' field because of the hit description in the intervening space.

I was trying to go about it with grouped regexes, but without being physically adjacent it doesn't work (something like these, though there may be errors in them):

regex = re.compile(r'''
                  (\w{4}_\w) # Match the hit
                  (\d{1,3}\.\d') # Match the probability score
                  (\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # E value as float/E-
                  (\d\.?\d?|\d\.?\d?E-\d\d|\d\.\d*) # Match SI or float P value
                  (\d+\.\d+) # Match the score
                 ''',re.VERBOSE)

The file in question:

Query         PAU_03380 PAU_03380 hypothetical protein 3919442:3920968 reverse MW:51681
Match_columns 508
No_of_seqs    1 out of 1
Neff          1.0
Searched_HMMs 37488
Date          Mon May 23 20:23:54 2016
Command       hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/prot_all/PAU_03380.faa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 5 -Z 5 -E 1E-03 -nocons -nopred -nodssp

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 3izo_F Fiber; pentameric pento  98.1 2.7E-09 7.3E-14  107.6   0.0   65   93-160   104-168 (581)
  2 3izo_F Fiber; pentameric pento  97.6 1.3E-07 3.4E-12   95.6   0.0  156  156-317   210-388 (581)
  3 1ocy_A Bacteriophage T4 short   97.6 1.8E-07 4.7E-12   80.4   0.0   85  323-418    10-122 (198)
  4 1v1h_A Fibritin, fiber protein  96.1 0.00011   3E-09   60.4   0.0   30  167-198     2-31  (103)
  5 1v1h_A Fibritin, fiber protein  95.9 0.00019 5.1E-09   59.1   0.0   10  168-177    41-50  (103)
  6 1pdi_A Short tail fiber protei  95.6 0.00041 1.1E-08   63.3   0.0   26  323-348    90-116 (278)
  7 2xgf_A Long tail fiber protein  94.1   0.005 1.3E-07   55.1   0.0   31  318-348    22-52  (242)
  8 1h6w_A Bacteriophage T4 short   84.7    0.25 6.7E-06   47.1   0.0   27  323-349   255-282 (312)
  9 1qiu_A Adenovirus fibre; fibre  79.9    0.54 1.4E-05   44.4   0.0   24   92-115     7-30  (264)
 10 3s6x_A Outer capsid protein si  72.0     1.3 3.4E-05   43.6   0.0   69  106-191    44-112 (325)

No 1
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=98.13  E-value=2.7e-09  Score=107.58  Aligned_cols=65  Identities=22%  Similarity=0.362  Sum_probs=42.7

 Q PAU_03380        93  PLILKDDVLSVDLGSGLTNETNGICVGQGDGITVNTSNVAVKQGNGISVTSSGGVAVKVSANKGLSVD  160 (508)
                  ||-+.++-|.++....|+...+++.+--+++++|+.....++....++++ .+++++++.  .||.++
T 3izo_F          104 PLTVTSEALTVAAAAPLMVAGNTLTMQSQAPLTVHDSKLSIATQGPLTVS-EGKLALQTS--GPLTTT  168 (581)
Confidence            55555556666666667777777777777777777776777777777764 566666554  355554


No 2
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=97.60  E-value=1.3e-07  Score=95.57  Aligned_cols=156  Identities=19%  Similarity=0.323  Sum_probs=85.6

Q PAU_03380       156 GLSVDSSGVAVKVNTDKGISVDGNGVAVKVNTSKGISVDNTGVAVIANASKGISVDGSGV--------------AVIANT  221 (508)
                  .|.+..++-.+.+++..|+.|.++.+.+|+  ..++.+++.|- +-.+...|+.++...-              .+..+.
T 3izo_F          210 PLHVTDDLNTLTVATGPGVTINNTSLQTKV--TGALGFDSQGN-MQLNVAGGLRIDSQNRRLILDVSYPFDAQNQLNLRL  286 (581)
Confidence            344544434556666667777666655443  23333333221 1111222333332211              234445

It goes on a bit but is just more of the above 2 alignments.

UPDATE 1

Just to provide an example of what I'd ideally like at the end:

Given the line in the 'short table':

 1 3izo_F Fiber; pentameric pento  98.1 2.7E-09 7.3E-14  107.6   0.0   65   93-160   104-168 (581)

I'd like to get either a delimited string, or separate match.group for:

The PDB Hit ID == 3izo_F

Each of the first 4 metrics (as separate groups ideally, but I could deal with that after the fact) = 98.1 2.7E-09 7.3E-14 107.6

Such a shame this program doesn't just provide a proper tabular output :(

With something this complex, you are probably better off using multiple passes. You may want to look into pandas as a more general solution. The dataframes type in pandas is very useful when working with data tables like this. http://pandas.pydata.org/ — pizoelectric, Oct 11 '16 at 17:35
Which part of the file are you trying to parse? Short table where hit can be truncated or the expanded part? — Alexey Guseynov, Oct 11 '16 at 17:35
I don't think you'd call this 'typical' tabular data (since it's not reliably delimited) - can pandas handle that too? @C8H10N4O2 - I don't think I have any choice but to treat it via regex because it's not standard tabular format as I say - and as you can see, the file has content before and after in different formats that I'm disregarding. — Joe Healey, Oct 12 '16 at 09:11
@JoeHealey elsewhere you say you are OK with disregarding the non-tabular data. So I suggest that you see `pandas.read_fwf`, particularly the `skiprows=` and `nrows=` parameters, to read only the tabular portion. — C8H10N4O2, Oct 12 '16 at 14:11
Ah I see - you mean to use in conjuction once I've stripped out the required lines of the file. On a brief look through it looks like it may well do what I'm trying to do - I shall investigate further! — Joe Healey, Oct 12 '16 at 14:35

score 1 · Answer 1 · answered Oct 11 '16 at 18:16

You have two parts in your data file. One is compact table:

  1 3izo_F Fiber; pentameric pento  98.1 2.7E-09 7.3E-14  107.6   0.0   65   93-160   104-168 (581)
  2 3izo_F Fiber; pentameric pento  97.6 1.3E-07 3.4E-12   95.6   0.0  156  156-317   210-388 (581)
  3 1ocy_A Bacteriophage T4 short   97.6 1.8E-07 4.7E-12   80.4   0.0   85

Fields in it have fixed position. So instead of regular expressions you can use simple substrings:

line[4:34]  for hit
line[36:40] for prob

But that table has trimmed hit field. If you want it's full content you have to parse second part of the file. And multiline regular expressions are a good choice for that. This one finds hit, probability and E-value, fill free to expand it.

re.compile(r"No \d*\n>([^\n]*)\nProbab=([\d\.e\-]*).*E-value=([\d\.e\-]*).*", re.MULTILINE)

But that part of the file does not contain P-value. So it seems that you will have to combine these methods.

I'm fine to disregard the 'trimmed hit description' part of the compact table. Also fine to disregard everything outside of that table. I've updated the question to include an example of the output I'm trying to achieve. — Joe Healey, Oct 12 '16 at 09:14

score 1 · Accepted Answer · answered Oct 12 '16 at 14:58

It is possible to use pandas.read_fwf to read the tabular portion, but because your table headers are malformed (i.e. sometimes a space is part of a variable name, as in Query HMM, and sometimes it separates variable names, as in SS and Cols) you are going to have to specify the column widths.

I like to use a template row to do this.

from io import StringIO

yourTemplate= \
"""
---|-------------------------------|----|-------|-------|------|-----|----|---------|--------------|
 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 3izo_F Fiber; pentameric pento  98.1 2.7E-09 7.3E-14  107.6   0.0   65   93-160   104-168 (581)
  2 3izo_F Fiber; pentameric pento  97.6 1.3E-07 3.4E-12   95.6   0.0  156  156-317   210-388 (581)
"""
yourPattern = StringIO(yourTemplate).readlines()[1]

colBreaks = [i for i, ch in enumerate(yourPattern) if ch == '|']

yourWidths = [j-i for i, j in zip( ([0]+colBreaks)[:-1], colBreaks ) ]

Then we can go back to your file.

yourText= \
"""Neff          1.0
Searched_HMMs 37488
Date          Mon May 23 20:23:54 2016
Command       hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/prot_all/PAU_03380.faa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 5 -Z 5 -E 1E-03 -nocons -nopred -nodssp

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 3izo_F Fiber; pentameric pento  98.1 2.7E-09 7.3E-14  107.6   0.0   65   93-160   104-168 (581)
  2 3izo_F Fiber; pentameric pento  97.6 1.3E-07 3.4E-12   95.6   0.0  156  156-317   210-388 (581)
  3 1ocy_A Bacteriophage T4 short   97.6 1.8E-07 4.7E-12   80.4   0.0   85  323-418    10-122 (198)
  4 1v1h_A Fibritin, fiber protein  96.1 0.00011   3E-09   60.4   0.0   30  167-198     2-31  (103)
  5 1v1h_A Fibritin, fiber protein  95.9 0.00019 5.1E-09   59.1   0.0   10  168-177    41-50  (103)
  6 1pdi_A Short tail fiber protei  95.6 0.00041 1.1E-08   63.3   0.0   26  323-348    90-116 (278)
  7 2xgf_A Long tail fiber protein  94.1   0.005 1.3E-07   55.1   0.0   31  318-348    22-52  (242)
  8 1h6w_A Bacteriophage T4 short   84.7    0.25 6.7E-06   47.1   0.0   27  323-349   255-282 (312)
  9 1qiu_A Adenovirus fibre; fibre  79.9    0.54 1.4E-05   44.4   0.0   24   92-115     7-30  (264)
 10 3s6x_A Outer capsid protein si  72.0     1.3 3.4E-05   43.6   0.0   69  106-191    44-112 (325)

No 1
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=98.13  E-value=2.7e-09  Score=107.58  Aligned_cols=65  Identities=22%  Similarity=0.362  Sum_probs=42.7

 Q PAU_03380        93  PLILKDDVLSVDLGSGLTNETNGICVGQGDGITVNTSNVAVKQGNGISVTSSGGVAVKVSANKGLSVD  160 (508)
                  ||-+.++-|.++....|+...+++.+--+++++|+.....++....++++ .+++++++.  .||.++
T 3izo_F          104 PLTVTSEALTVAAAAPLMVAGNTLTMQSQAPLTVHDSKLSIATQGPLTVS-EGKLALQTS--GPLTTT  168 (581)
Confidence            55555556666666667777777777777777777776777777777764 566666554  355554


No 2
>3izo_F Fiber; pentameric penton base, trimeri viral protein; 3.60A {Human adenovirus 5}
Probab=97.60  E-value=1.3e-07  Score=95.57  Aligned_cols=156  Identities=19%  Similarity=0.323  Sum_probs=85.6

Q PAU_03380       156 GLSVDSSGVAVKVNTDKGISVDGNGVAVKVNTSKGISVDNTGVAVIANASKGISVDGSGV--------------AVIANT  221 (508)
                  .|.+..++-.+.+++..|+.|.++.+.+|+  ..++.+++.|- +-.+...|+.++...-              .+..+.
T 3izo_F          210 PLHVTDDLNTLTVATGPGVTINNTSLQTKV--TGALGFDSQGN-MQLNVAGGLRIDSQNRRLILDVSYPFDAQNQLNLRL  286 (581)
Confidence            344544434556666667777666655443  23333333221 1111222333332211              234445
"""

We note that to get to the tabular portion (starting with the header) we need to skip 5 rows, then keep 10 rows.

import pandas as pd
yourData = pd.read_fwf(StringIO(yourText), skiprows=5, nrows=10, header=0, widths = yourWidths)
print(yourData.dtypes)
print(yourData)

This should give you what you want, in tabular form:

print(yourData.dtypes)
print(yourData)

No                int64
Hit              object
Prob            float64
E-value         float64
P-value         float64
Score           float64
SS              float64
Cols              int64
Query HMM        object
Template HMM     object
dtype: object
   No                             Hit  Prob       E-value       P-value  \
0   1  3izo_F Fiber; pentameric pento  98.1  2.700000e-09  7.300000e-14   
1   2  3izo_F Fiber; pentameric pento  97.6  1.300000e-07  3.400000e-12   
2   3   1ocy_A Bacteriophage T4 short  97.6  1.800000e-07  4.700000e-12   
3   4  1v1h_A Fibritin, fiber protein  96.1  1.100000e-04  3.000000e-09   
4   5  1v1h_A Fibritin, fiber protein  95.9  1.900000e-04  5.100000e-09   
5   6  1pdi_A Short tail fiber protei  95.6  4.100000e-04  1.100000e-08   
6   7  2xgf_A Long tail fiber protein  94.1  5.000000e-03  1.300000e-07   
7   8   1h6w_A Bacteriophage T4 short  84.7  2.500000e-01  6.700000e-06   
8   9  1qiu_A Adenovirus fibre; fibre  79.9  5.400000e-01  1.400000e-05   
9  10  3s6x_A Outer capsid protein si  72.0  1.300000e+00  3.400000e-05   

   Score   SS  Cols Query HMM   Template HMM  
0  107.6  0.0    65    93-160  104-168 (581)  
1   95.6  0.0   156   156-317  210-388 (581)  
2   80.4  0.0    85   323-418   10-122 (198)  
3   60.4  0.0    30   167-198    2-31  (103)  
4   59.1  0.0    10   168-177   41-50  (103)  
5   63.3  0.0    26   323-348   90-116 (278)  
6   55.1  0.0    31   318-348   22-52  (242)  
7   47.1  0.0    27   323-349  255-282 (312)  
8   44.4  0.0    24    92-115    7-30  (264)  
9   43.6  0.0    69   106-191   44-112 (325)

The pandas syntax to access these values is quite straightforward, as in yourData.loc[3,'Prob']

I may be trying to do this while still not fully awake and this is probably really obvious, but I'm getting StringIO complain at be about the template not being unicode. What have I got to do to the template string to make it shut up :P ? I know a `u` can be given to `StringIO(u'somestring')` as in this link: http://stackoverflow.com/questions/22316333/how-can-i-resolve-typeerror-with-stringio-in-python-2-7 ....But i'm passing in a variable so I can't enclose it in quotes — Joe Healey, Oct 14 '16 at 08:55
ah nevermind...I woke up a little more and realised it just needed the `u` adding prior to the first ''' of the template string — Joe Healey, Oct 14 '16 at 09:04

python regex separated elements of a string

2 Answers2