i have a protein sequence file i want to count trimers in it using sed or grep

Question

I have a protein sequence file in the following format

uniprotID\space\sequence

sequence is a string of any length but with only 20 allowed letters i.e.

ARNDCQEGHILKMFPSTWYV

Example of 1 record

Q5768D AKCCACAKCCAC

I want to create a csv file in the following format

Q5768D     

12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2

This is what I'm currently trying:

#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g'  > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[@]})
  for (( X=0; X <= "$SQL"; X++ ))
      do
         Y=$(expr $X + 1)
         Z=$(expr $X + 2)
         echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
     done  | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1

in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.

How many lines in the input file? Are you looking for that many columns in the output file? And how do you determine a trimer (sorry, I know nothing about proteins).? Would any 3 in the example count? I.e., the first seen is ARC, and the second is RCQ, and the third is CQT? — donjuedo, Jul 19 '15 at 13:36
@donjuedo just think of them as strings to begin with. i used sort command to sort the trimers so that the trimers are alphabetical in order. RCQ will be in lower end of the file. the script is functional but its very slow. i need to increase its speed so that at least more than 0.1 million lines of the record file can be processed in an hour. currently as u can see for each line a seperate file is being created. which is time consuming. i desperately need toincrease its speed — samar, Jul 19 '15 at 13:53
Must it be done in shell script? Not only could you be stretching bash beyond its limits, but it may be simpler and a whole lot faster to both write and execute in a proper application language. It does matter much which language you chose. If you have experience in one already, go with that. If not, python could be a good place to start. — Bohemian, Jul 19 '15 at 16:26
and pls explain that "csv" - csv is not for storing key-value pairs (sequence-count in your case) in rows. sounds more like JSON or other structured text format. — rogerovo, Jul 19 '15 at 19:36
i figured it out there was tab instead of space in the lines. i have corrected it. thank you for your answer and effort — samar, Jul 20 '15 at 13:39

score 2 · Answer 1 · edited May 23 '17 at 11:58

If this is what you want:

$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D  OTHER
12      12
AKC 2   FOO 2
KCC 2   OOB 2
CCA 2   OBA 2
CAC 2   BAR 2
ACA 1   ARF 1
CAK 1   RFO 1

This will do it:

$ cat tst.awk
BEGIN { OFS="\t" }
{
    colNr = NR
    rowNr = 0
    name[colNr] = $1
    lgth[colNr] = length($2)
    delete name2nr
    for (i=1;i<=(length($2)-2);i++) {
        trimer = substr($2,i,3)
        if ( !(trimer in name2nr) ) {
            name2nr[trimer] = ++rowNr
            nr2name[colNr,rowNr] = trimer
        }
        cnt[colNr,name2nr[trimer]]++
    }
    numCols = colNr
    numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
    for (colNr=1;colNr<=numCols;colNr++) {
        printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
    }
    for (colNr=1;colNr<=numCols;colNr++) {
        printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
    }
    for (rowNr=1;rowNr<=numRows;rowNr++) {
        for (colNr=1;colNr<=numCols;colNr++) {
            printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
        }
    }
}

If instead you want output like in @rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:

$ cat tst2.awk
{
    delete cnt
    for (i=1;i<=(length($2)-2);i++) {
        cnt[substr($2,i,3)]++
    }
    printf "%s;%s", $1, length($2)
    for (trimer in cnt) {
        printf ";%s=%s", trimer, cnt[trimer]
    }
    print ""
}

$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2

thank you @ed morton. I tried your solution also. but rogerovos solution is faster. thank you for solutions and suggestions i will keep in mind these things. — samar, Jul 21 '15 at 05:14

rogerovo · Accepted Answer · 2015-07-20T14:09:14.603

This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)

output:

Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...

code:

#!/usr/bin/perl
use strict;
$|=1;
my $c;

# process each line on input
while (readline STDIN) {
  $c++; chomp;
  # is it a valid line? has the format and a sequence to process
  if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
    print join ";",($1,length($2));
    my %trimdb;
    my $seq=$2;
    #split the sequence into chars
    my @a=split //,$seq;
    my @trimmer;

    # while there are unprocessed chars in the sequence...
    while (scalar @a) {

      # fill up the buffer with a char from the top of the sequence
      push @trimmer, shift @a;

      # if the buffer is full (has 3 chars), increase the trimer frequency
      if (scalar @trimmer == 3 ) {
        $trimdb{(join "",@trimmer)}++;
        # drop the first letter from buffer, for next loop
        shift @trimmer;
      }
    }

    # we're done with the sequence - print the sorted list of trimers
    foreach (sort keys %trimdb) {

      #print in a csv (;) line
      print ";$_=$trimdb{$_}";
    }
    print"\n";
  }
  else {
    #the input line was not valid.
    print STDERR "input error: $_\n";
  }
  # just a progress counter
  printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";

if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

thank you all for the answers. @rogerovo can you comment in lines so that i can understand the answer. i know nothing about perl. — samar, Jul 20 '15 at 09:18
may be the line endings... is it from Mac or Win ? try this version, or we will shift to gist.github.com then :) — rogerovo, Jul 20 '15 at 13:34
clearly, you need someone to analyze and fix that process. the time will be cheaper than yours doing all that manual stuff in near future ;) — rogerovo, Jul 20 '15 at 19:45

i have a protein sequence file i want to count trimers in it using sed or grep

2 Answers2