0

I have downloaded the Pfam database, but in order to proceed with my work I would need to split it into different individual files. I tried to do it with the command hmmfetch:

Usage: hmmfetch [options] -f <hmmfile> <keyfile>  (retrieves all HMMs in <keyfile>)

Following this procedure I am able to retrieve some Hmms, but I have to specify the name in the keyfile. This approach is not convenient as I have to retrieve all the Hmms that are present in the original file.

The next thing I tried to do is to split the original file into individual ones using the following command:

csplit --digits=2  --quiet --prefix=hmm Pfam-A.hmm "////+1" "{*}"

This worked perfectly fine to split the file into individual ones, the only thing that I could not figure out is how to give each file the name of the hmm. Each hmm file looks like this:

HMMER3/f [3.1b2 | February 2015]
NAME  120_Rick_ant
ACC   PF12574.11
DESC  120 KDa Rickettsia surface antigen
LENG  238
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Tue Oct 12 02:07:11 2021
NSEQ  2
EFFN  0.449219
CKSUM 3984216663
GA    25 25;
TC    39.8 39.6;
NC    23.6 21.2;
BM    hmmbuild HMM.ann SEED.ann
SM    hmmsearch -Z 61295632 -E 1000 --cpu 4 HMM pfamseq
STATS LOCAL MSV      -10.8956  0.70336
STATS LOCAL VITERBI  -11.6161  0.70336
STATS LOCAL FORWARD   -5.3029  0.70336
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.48852  4.43316  2.82069  2.56851  3.39369  2.73712  3.79297  2.89060  2.54228  2.53662  3.76796  3.01951  3.39446  3.08353  3.05948  2.67787  2.83658  2.66102  4.89473  3.44979
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.00000        *
      1   3.11165  4.58599  4.12585  3.76620  3.12182  3.93147  4.43434  2.32453  3.53431  0.92536  3.15834  4.04543  4.37407  3.91210  3.71656  3.49871  3.40796  2.35149  4.98612  3.70011      1 l - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.48576  0.95510
      2   1.07216  4.17353  3.42348  3.21371  4.01396  2.99897  4.24029  3.13365  3.22896  3.01700  4.05375  3.37300  3.73453  3.57391  3.48180  2.52446  2.79912  2.79493  5.44509  4.24110      2 a - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.48576  0.95510
      3   2.91965  5.02079  2.47306  1.08285  4.36227  3.24954  3.83381  3.80837  2.70946  3.43216  4.40865  2.91254  3.85246  3.05076  3.11366  2.90651  3.22382  3.49656  5.54134  4.26436      3 e - - -
...
//

Using my commands approach this file is called "hmm01", but I would like it to be named "120_Rick_ant.hmm". Does anyone one know something that could do the trick? Thanks in advance!

  • 1
    So you want to rename the files based on the `NAME` key? What if two files have the same `NAME`, how do you fuse them? – Fravadona Jan 20 '22 at 16:19
  • Exactly. The files should never have the same `NAME` key. I am looking for a way to rename the files, as you said, or another option would be to find another way to split the original fiile that directly names the files based on the `NAME` key. – Alex galvez morante Jan 20 '22 at 16:23
  • Does this answer your question? [csplit prefix as file context](https://stackoverflow.com/questions/38635647/csplit-prefix-as-file-context) – Jacques Gaudin Jan 20 '22 at 16:36
  • 1
    You've tagged 3 languages but only seem to be using 1 of them – camille Jan 20 '22 at 16:37

1 Answers1

2

A basic solution using GNU/BSD awk:

#!/bin/bash

while read -r id filename
do
    echo mv "$filename" "$id".hmm
done < <(awk '$1 == "NAME" {print $2,FILENAME; nextfile}' hmm*)
Fravadona
  • 13,917
  • 1
  • 23
  • 35
  • Thanks this is super useful! The only problem that I have is that this is printing the commands in my linux terminal, not executing them. How could I solve that? – Alex galvez morante Jan 20 '22 at 16:53
  • 1
    It's normal. Once you checked that it does what you expect then remove the `echo` before the `mv` command – Fravadona Jan 20 '22 at 16:56