Convert a hmmer --tblout output to a pandas dataframe

Question

Is there a way to convert a hmmer output to a pandas dataframe?
I am also unsure how to load a hmmer tblout table into python via the Bio module.

I believe you can call a hmmer format with SeqIO.parse or SeqIO.search.The format of the table appears tab separated however it seems to be a collection of random spaces meaning if I remove the headers and # leaving only the table information there is not easy way to split the table using a tab separator.

A small example of a hmmer --tblout file is below:

#                                                                                       --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                                   accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------                         -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
3300000568@Draft_10015026@Draft_1001502652 -          Bacteria_NODE_1_length_628658_cov_8.291329_24 -            7.1e-07   29.3   0.0   1.9e-05   24.6   0.0   2.0   1   1   1   2   2   2   2 -
7000000546@SRS019910_WUGC_scaffold_3948@SRS019910_WUGC_scaffold_3948_gene_2890 -          Bacteria_NODE_1_length_628658_cov_8.291329_53 -            1.6e-07   31.7   0.0   0.00051   20.3   0.0   2.2   2   0   0   2   2   2   2 -
#
# Program:         hmmscan
# Version:         3.1b2 (February 2015)
# Pipeline mode:   SCAN
# Query file:      ../Exponential_High_Complexity_Simulation.faa
# Target file:     final_list.hmm
# Option settings: hmmscan --tblout Exponential_Earth.txt -E 1e-5 --cpu 8 final_list.hmm ../Exponential_High_Complexity_Simulation.faa 
# Current dir:     /Strong/home/glickmanc/Programs/EarthVirome
# Date:            Mon Feb 24 10:47:51 2020
# [ok]

BioGeek · Accepted Answer · 2020-05-27T14:44:49.130

I would build a dictionary from the attributes you are interested in and make a DataFrame from that dictionary. Say you are interested in the attributes of the hits:

from collections import defaultdict
import pandas as pd
from Bio import SearchIO

filename = 'test.hmmer'

attribs = ['accession', 'bias', 'bitscore', 'description', 'cluster_num', 'domain_exp_num',  'domain_included_num', 'domain_obs_num', 'domain_reported_num', 'env_num', 'evalue', 'id', 'overlap_num', 'region_num']

hits = defaultdict(list)

with open(filename) as handle:
    for queryresult in SearchIO.parse(handle, 'hmmer3-tab'):
      #print(queryresult.id)
      #print(queryresult.accession)
      #print(queryresult.description)
      for hit in queryresult.hits:
        for attrib in attribs:
          hits[attrib].append(getattr(hit, attrib))

pd.DataFrame.from_dict(hits)

In my case, this works great, but fails to properly populate the query accession (helpfully supplied but commented out at `queryresult.accession`). I instead get `-` rather than the actual accession. My solution was to simply save this at the outer loop and then pass it into the hits dict for each hit of the query rather than include it in `attribs`. — Maximilian Press, Jan 04 '21 at 20:58

Convert a hmmer --tblout output to a pandas dataframe

1 Answers1