0

I'm looking to generate a folder with pdb file of every peptide of 7 (lentgh) specific amino acids. I was thinking to firstly making a simple linux script to generate a file with all 7 letter combination like this :

AAAAAAA
AAAAAAB
AAAAABA
AAAABAA
AAABAAA
AABAAAA
ABAAAAA
BAAAAAA
AAAAABB
AAAABAB
...

I think this script can work but I'm not sure :

for c1 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
do
    for c2 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
    do
        for c3 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
        do
            for c4 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
            do
                for c5 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
                do
                    printf "%s\n" "$c1$c2$c3$c4$c5"
                done
            done
        done
    done
done

And then using and other simple script which every row of the last file generate a peptide with pymol with this command :

for aa in "row1": cmd._alt(string.lower(aa))
save row1.pdb, all

I'm new in scripting to linux. Is anyone can help me please? Thanks

Grego
  • 59
  • 2
  • 9
  • Your script only generates 5-letter sequences, not 7-letter sequences, but otherwise would work. Whether it is efficient is a separate discussion. It generates 1M rows as written; the 7-letter version would generate 256M records, requiring 2 GiB disk space to hold it all. It's your judgement call whether that is 2 GiB well used. – Jonathan Leffler Oct 16 '15 at 21:47
  • Oops; I can't count — I miscounted 16 instead of 19. The size grows; 16^5 is 1M; 19^5 is around 2.5M. 16^7 is 256M; 19^7 is just under 900M. Scale accordingly. The bash script is slower than using `sed` iteratively (I was able to generate 5 levels with `sed` in a bit under 3 seconds; the shell script above took about 34 seconds, or 10 times as long. Test environment: 13" MacBook Pro, 2.7 GHz Intel Core i5, 8 GiB RAM, SSD storage. It took about 50 seconds to generate length 6 peptides with `sed`, roughly 19 times as long because there were 19 times as many lines to process.) – Jonathan Leffler Oct 16 '15 at 22:08

4 Answers4

3

I took a look at the idea of (ab?)using brace expansion:

p='{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}'
eval echo $p$p$p$p$p$p$p

Using such direct approach with all in one simple step of 7 $p's is just too much for bash. For no apparent reason, it eats all the memory (measurements with time show no other memory value increase so quickly).
The command is quite quick and amazingly simple for up to about 4 $p's, just two lines:

p='{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}'
eval echo $p$p$p$p

however, memory usage grows quite quickly. At a depth of 6 $p repetitions the process eats more than 7.80 Gigs of memory. The eval part also helps to increases execution time and memory usage.

An alternative approach was needed. So, I tried to make each step of expansion on its own, taking advantage of the concept that Jonathan Leffler used. For each line in the input, write 19 lines, each with an additional letter to the output. I found out that any eval is an important memory drain (not shown here).

Bash

A simpler bash filter is:

bashfilter(){
    while read -r line; do
        printf '%s\n' ${line}{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
    done </dev/stdin
}

Which could be used for several levels of processing as:

echo | bashfilter | bashfilter | bashfilter

It just needs to repeat as many filter steps as letters per line are needed.

With this simpler approach: Memory was no longer an issue. Speed, however, got worse.

Leffler SED

Just to compare, use it as a measuring stick, I implemented Leffler idea:

# Building Leffler solution:
    leftext="$(<<<"${list}" sed -e 's/,/\n/g')"                 # list into a column.
    leftext="$(<<<"${leftext}" sed -e 's%.%s/$/&/p;s/&$//%')"   # each line ==> s/$/?/p;s/?$//
    # echo -e "This is the leffilter \n$leftext"
leffilter(){ sed -ne "$leftext"; }    # Define a function for easy use.

And is the leffilter which could be used recursively to get as many letters per line as needed:

echo | leffilter | leffilter | leffilter

Leffler solution did a letter insert and a letter erase.


SED

The amount of work could be cut down by not needing to erase one letter. We could store the original pattern space in the "hold space".

Then, just copy the first line to the hold space (h), and keep restoring it (g) and inserting just one letter.

# Building a sed solution:
    sedtext="$(<<<"${list}" sed -e 's/,/\n/g')"    # list into a column.
    sedtext="$(<<<"${sedtext}" sed -e 's%[A-Z]%g;s/$/&/p;%g')"  # s/$/?/p
    sedtext="$(<<<"${sedtext}" sed -e '1 s/g/h/' )"             # 1st is h
sedfilter(){ sed -ne "$sedtext"; }    # Define a function for easy use.  

Doing this makes speed better, about 1/3 (33%) lower. Or 1.47 times faster.


AWK

Finally, I present an AWK solution. I wrote it earlier, but is the fastest. And so I present it as the last option. The best till someone presents a better one :-)

# An AWK based solution:
awkfilter(){ awk 'BEGIN { split( "'"$list"'",l,",");}
                        { for (i in l) print $0 l[i] }'
}

Yep, just two lines. It is half the time or twice as fast as Leffler solution.

The full test script used follows. It re-calls itself to enable the use of the external time. Make sure it is an executable file with bash.

#!/bin/bash
TIMEFORMAT='%3lR %3lU %3lS'
list="A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y"

# A pure bash based solution:
bashfilter(){
    while read -r line; do
        printf '%s\n' ${line}{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
    done </dev/stdin
}

# Building Leffler solution:
    leftext="$(<<<"${list}" sed -e 's/,/\n/g')"                 # list into a column.
    leftext="$(<<<"${leftext}" sed -e 's%.%s/$/&/p;s/&$//%')"   # each line ==> s/$/?/p;s/?$//
    # echo -e "This is the lef filter \n$leftext"
leffilter(){ sed -ne "$leftext"; }    # Define a function for easy use.

# Building a sed solution:
    sedtext="$(<<<"${list}" sed -e 's/,/\n/g')"                 # list into a column.
    sedtext="$(<<<"${sedtext}" sed -e 's%[A-Z]%g;s/$/&/p;%g')"  # each letter ==> s/$/?/p
    sedtext="$(<<<"${sedtext}" sed -e '1 s/g/h/' )"             # First command is 'h'.
    # echo -e "This is the sed filter \n$sedtext"
sedfilter(){ sed -ne "$sedtext"; }    # Define a function for easy use.

# An AWK based solution:
awkfilter(){ awk 'BEGIN { split( "'"$list"'",l,",");}
                        { for (i in l) print $0 l[i] }'
}

# Execute command filter
docommand(){
    local a count="$1" filter="$2" peptfile="$3"
    for (( i=0; i<count; i++ )); do
        case $filter in
            firsttry) a+=("{$list}"); ;;
            *)        a+=("| $filter"); ;;
        esac
    done
    [[ $filter == firsttry ]] && a+=('| sed '"'"'s/ /\n/'"'" )
    [[ -n $peptfile ]] && peptfile="$peptfile.$count"

    eval 'echo '"$(printf '%s' "${a[@]}")" > "${peptfile:-/dev/null}";
}

callcmd(){
    tf='wall:%e s:%S u:%U (%Xtext+%Ddata %F %p %t %Kmem %Mmax)'
    printf '%-12.12s' "$1" >&2
    /usr/bin/time -f "$tf" "$0" "$repeats" "$1" "$2"
}

nofile=1
if (( $#>=2 )); then
    docommand "$1" "$2" "$3"; exit 0
else
    for (( i=1; i<=6; i++)); do
        repeats=$i; echo "repeats done = $repeats"
        if ((nofile)); then
            callcmd firsttry
            callcmd bashfilter
            callcmd leffilter
            callcmd sedfilter
            callcmd awkfilter
        else
            callcmd firsttry   peptidesF
            callcmd bashfilter peptidesB
            callcmd leffilter  peptidesL
            callcmd sedfilter  peptidesS
            callcmd awkfilter  peptidesA
        fi
    done
fi

RESULTS

The external program /usr/bin/time was used (instead of the bash builtin time) to be able to measure the memory used. It was important in this problem.

With: tf='wall:%e s:%S u:%U (%Xtext+%Ddata %F %p %t %Kmem %Mmax)'

The results for 7 loops and true file output are easy to find with the above script, but I feel that filling about 21 gigs of disk space was just too much.

The results up to 6 loops were:

   repeats done = 1
firsttry    wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
bashfilter  wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1552max)
leffilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
sedfilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
awkfilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)

:

   repeats done = 2
firsttry    wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
bashfilter  wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1552max)
leffilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
sedfilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
awkfilter   wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)

:

   repeats done = 3
firsttry    wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1796max)
bashfilter  wall:0.07 s:0.00 u:0.05 (0text+0data 0 0 0 0mem 1552max)
leffilter   wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
sedfilter   wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
awkfilter   wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)

:

   repeats done = 4
firsttry    wall:0.28 s:0.01 u:0.26 (0text+0data 0 0 0 0mem 25268max)
bashfilter  wall:0.96 s:0.03 u:0.94 (0text+0data 0 0 0 0mem 1552max)
leffilter   wall:0.13 s:0.00 u:0.12 (0text+0data 0 0 0 0mem 1560max)
sedfilter   wall:0.10 s:0.00 u:0.08 (0text+0data 0 0 0 0mem 1560max)
awkfilter   wall:0.09 s:0.00 u:0.07 (0text+0data 0 0 0 0mem 1560max)

:

   repeats done = 5
firsttry    wall:4.98 s:0.36 u:4.76 (0text+0data 0 0 0 0mem 465100max)
bashfilter  wall:20.19 s:0.81 u:20.18 (0text+0data 0 0 0 0mem 1552max)
leffilter   wall:2.43 s:0.00 u:2.50 (0text+0data 0 0 0 0mem 1556max)
sedfilter   wall:1.83 s:0.01 u:1.87 (0text+0data 0 0 0 0mem 1556max)
awkfilter   wall:1.49 s:0.00 u:1.54 (0text+0data 0 0 0 0mem 1560max)

:

   repeats done = 6
firsttry    wall:893.06 s:30.04 u:105.22 (0text+0data 402288 0 0 0mem 7802372m)
bashfilter  wall:365.13 s:14.95 u:368.09 (0text+0data 0 0 0 0mem 1548max)
leffilter   wall:51.90 s:0.09 u:53.91 (0text+0data 6 0 0 0mem 1560max)
sedfilter   wall:35.17 s:0.08 u:36.67 (0text+0data 0 0 0 0mem 1556max)
awkfilter   wall:25.60 s:0.06 u:26.77 (0text+0data 1 0 0 0mem 1556max)
1

Disclaimer: while I'm very happy to have figured out this algorithm based on base-19 numbers, it's unbearably slow (8 seconds for 3-letter strings, 160 seconds for the 4-letter ones, both with 19 amino acids, ran on a 2.2 GHz core i7 without actually saving the output) compared to other solutions as Jonathan Leffler hinted at. I'll leave it here anyway, in case someone else finds it as fun as I did.

Here's a possible alternative, with up to 19 amino acids (the ones you quote in your code):

aminoarr=("A" "D" "E" "F" "G" "H" "I" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W" "Y")

peplength=7
aminonum=19

N=0
while [ $N -le $(( ${aminonum}**${peplength} - 1 )) ]; do
  remain=$N
  #printf "%d " $N
  for k in $(seq $(( ${peplength}-1 )) -1 0 ) ; do
    digit=$(( ${remain} / (${aminonum}**${k}) ))
    printf "%s" ${aminoarr[$digit]}
    let remain=$(( ${remain} - ${digit}*(${aminonum}**${k}) ))
  done
  echo
  let N=${N}+1
done

Initially we define the array of amino acids (aminoarr), the length of the peptids we cant to generate (peplength), and the number of aminoacids from the list we want to choose (aminonum, should not be greater than 19).

We then loop from N to aminonum^peplength -1, essentially generating all possible numbers of base 19 with up to 7 digits (if we stick to the parameters in your question). Then we decompose each number in base 19 and choose the according aminoacids from the array aminoarr. Note that in base 19 every digit will fall between 0 and 18, so they are perfect for indexing the 19-element aminoarr.

If you uncomment the printf line it will give you the number of the given sequence, but this will make your file even larger (as @Jonathan Leffler has very rightly commented on the output size).

Anyway, here's a sample output for the first 20 lines:

AAAAAAA
AAAAAAD
AAAAAAE
AAAAAAF
AAAAAAG
AAAAAAH
AAAAAAI
AAAAAAK
AAAAAAL
AAAAAAM
AAAAAAN
AAAAAAP
AAAAAAQ
AAAAAAR
AAAAAAS
AAAAAAT
AAAAAAV
AAAAAAW
AAAAAAY
AAAAADA
Community
  • 1
  • 1
  • Did you time it on a full run, even if the output was to /dev/null instead of precious disk space? – Jonathan Leffler Oct 16 '15 at 22:10
  • @JonathanLeffler no, but I ran a `|tail -20` to check the end. Suffice to say, it's been running for 10 minutes:) I forgot to `time` it. Do you think it's unnecessarily slow due to all the arithmetic? Probably a 7-loop overhead would still be less. On the other hand, this is more flexible. – Andras Deak -- Слава Україні Oct 16 '15 at 22:13
  • @JonathanLeffler your possible suspicions are correct: I ran a 5-letter check to see how it compares to your benchmarks. It's been running again for like 10 minutes with no end yet. (And bad news: 2.2 GHz Core i7 with output to /dev/null.) – Andras Deak -- Слава Україні Oct 16 '15 at 22:36
  • @JonathanLeffler extrapolation gives 50 minutes for 5-letter strings and 13 days for 7-letter ones. I might not want to wait that much to get the output;) That recursive `sed` solution of yours sounds so much better. – Andras Deak -- Слава Україні Oct 16 '15 at 22:49
  • Interesting data (extrapolation). I'm not sure if `printf` is a shell built-in or not; if not, that's being executed quite a bit. And `seq` gets a fair workout too. And days vs minutes suggests it is not the fastest solution, for all it works. – Jonathan Leffler Oct 16 '15 at 23:01
  • @JonathanLeffler `printf` is a built-in, but you're right, I forgot that in each of the 900M inevitable iterations of the `while` there's a loop over digits, each using a couple of arithmetic operations and a call for `seq`. It's just not efficient enough. – Andras Deak -- Слава Україні Oct 16 '15 at 23:09
  • Thanks for this nice script. Now my problem is I dont know how to use each row with pymol to generate a new pdb file like : «for aa in sequence.txt"AAAAAAA": cmd._alt(string.lower(aa)) save AAAAAAA.pdb, all» – Grego Oct 19 '15 at 20:01
  • @Grego, firstly: thanks, but you should be looking at the answer by Jonathan Leffler. Secondly: I don't know the first thing about `pymol`, that's an all new question now, isn't it? – Andras Deak -- Слава Україні Oct 19 '15 at 20:04
  • Yes the script from Jonathan Leffler is nice too. You are right, I was asking 2 questions at once. I will make an other post. Thank you very much to both of you! – Grego Oct 19 '15 at 20:59
1

Here's a technique which produces the answer 'fairly fast'. Basically, it starts with a file containing a single newline, and the list of amino acid letters. It generates a sed script (using sed, of course) that successively adds an amino acid letter to the end of a line, prints it, removes it, and moves on to the next letter.

peptides-A.sh

printf "%s\n" A D E F G H I K L M N P Q R S T V W Y |
sed 's%.%s/$/&/p;s/&$//%' > peptides.sed
echo > peptides.0A      # Bootstrap the process
        sed -n -f peptides.sed peptides.0A > peptides.1A
        sed -n -f peptides.sed peptides.1A > peptides.2A
        sed -n -f peptides.sed peptides.2A > peptides.3A
timecmd sed -n -f peptides.sed peptides.3A > peptides.4A
timecmd sed -n -f peptides.sed peptides.4A > peptides.5A
timecmd sed -n -f peptides.sed peptides.5A > peptides.6A
timecmd sed -n -f peptides.sed peptides.6A > peptides.7A

You can think of 'timecmd' as a variant of time. It prints the start time, the command, then runs it, and then prints the end time and the elapsed time (wall-clock time only).

Sample output:

$ bash peptides-A.sh
2015-10-16 15:25:24
+ exec sed -n -f peptides.sed peptides.3A
2015-10-16 15:25:24 - elapsed: 00 00 00
2015-10-16 15:25:24
+ exec sed -n -f peptides.sed peptides.4A
2015-10-16 15:25:27 - elapsed: 00 00 03
2015-10-16 15:25:27
+ exec sed -n -f peptides.sed peptides.5A
2015-10-16 15:26:16 - elapsed: 00 00 49
2015-10-16 15:26:16
+ exec sed -n -f peptides.sed peptides.6A
2015-10-16 15:42:47 - elapsed: 00 16 31
$ ls -l peptides.?A; rm -f peptides-?A
-rw-r--r--  1 jleffler  staff           1 Oct 16 15:25 peptides.0A
-rw-r--r--  1 jleffler  staff          38 Oct 16 15:25 peptides.1A
-rw-r--r--  1 jleffler  staff        1083 Oct 16 15:25 peptides.2A
-rw-r--r--  1 jleffler  staff       27436 Oct 16 15:25 peptides.3A
-rw-r--r--  1 jleffler  staff      651605 Oct 16 15:25 peptides.4A
-rw-r--r--  1 jleffler  staff    14856594 Oct 16 15:25 peptides.5A
-rw-r--r--  1 jleffler  staff   329321167 Oct 16 15:26 peptides.6A
-rw-r--r--  1 jleffler  staff  7150973912 Oct 16 15:42 peptides.7A
$

I used the script from the question to create peptides.5B (the script was called peptides-B.sh on my disk), and checked that peptides.5A and peptides.5B were identical.

Test environment: 13" MacBook Pro, 2.7 GHz Intel Core i5, 8 GiB RAM, SSD storage.


Editing the start of the line instead of the end of the line yields approximately a 20% performance improvement.

Code:

printf "%s\n" A D E F G H I K L M N P Q R S T V W Y |
sed 's%.%s/^/&/p;s/^&//%' > peptides.sed
echo > peptides.0A      # Bootstrap the process
        sed -n -f peptides.sed peptides.0A > peptides.1A
        sed -n -f peptides.sed peptides.1A > peptides.2A
        sed -n -f peptides.sed peptides.2A > peptides.3A
timecmd sed -n -f peptides.sed peptides.3A > peptides.4A
timecmd sed -n -f peptides.sed peptides.4A > peptides.5A
timecmd sed -n -f peptides.sed peptides.5A > peptides.6A
timecmd sed -n -f peptides.sed peptides.6A > peptides.7A

Timing:

$ bash peptides-A.sh; ls -l peptides.?A; wc peptides.?A; rm -f peptides.?A
2015-10-16 16:05:48
+ exec sed -n -f peptides.sed peptides.3A
2015-10-16 16:05:48 - elapsed: 00 00 00
2015-10-16 16:05:48
+ exec sed -n -f peptides.sed peptides.4A
2015-10-16 16:05:50 - elapsed: 00 00 02
2015-10-16 16:05:50
+ exec sed -n -f peptides.sed peptides.5A
2015-10-16 16:06:28 - elapsed: 00 00 38
2015-10-16 16:06:28
+ exec sed -n -f peptides.sed peptides.6A
2015-10-16 16:18:51 - elapsed: 00 12 23
-rw-r--r--  1 jleffler  staff           1 Oct 16 16:05 peptides.0A
-rw-r--r--  1 jleffler  staff          38 Oct 16 16:05 peptides.1A
-rw-r--r--  1 jleffler  staff        1083 Oct 16 16:05 peptides.2A
-rw-r--r--  1 jleffler  staff       27436 Oct 16 16:05 peptides.3A
-rw-r--r--  1 jleffler  staff      651605 Oct 16 16:05 peptides.4A
-rw-r--r--  1 jleffler  staff    14856594 Oct 16 16:05 peptides.5A
-rw-r--r--  1 jleffler  staff   329321167 Oct 16 16:06 peptides.6A
-rw-r--r--  1 jleffler  staff  7150973912 Oct 16 16:18 peptides.7A
        1         0          1 peptides.0A
       19        19         38 peptides.1A
      361       361       1083 peptides.2A
     6859      6859      27436 peptides.3A
   130321    130321     651605 peptides.4A
  2476099   2476099   14856594 peptides.5A
 47045881  47045881  329321167 peptides.6A
893871739 893871739 7150973912 peptides.7A
943531280 943531279 7495831836 total
$

I tarted up the output from wc so it was 'properly columnar' (adding spaces, in other words). The original started going wonky when the numbers contained 8 digits.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • I believe "fairly fast" is a bit of an understatement for producing 900M lines in a quarter of an hour:) – Andras Deak -- Слава Україні Oct 16 '15 at 22:54
  • I suspect a lot of that is to do with using SSD; good old spinning magnetic platters would probably be quite a lot slower. And drat, I forgot to do `wc` on the files before I deleted them. Oh well... – Jonathan Leffler Oct 16 '15 at 22:55
  • But that doesn't apply if you send it to `/dev/null`, does it? Which is what I exclusively did. – Andras Deak -- Слава Україні Oct 16 '15 at 23:01
  • Oh — yes, that's true. I guess it is down to the use of `printf`, or just more complex arithmetic. I just realized I could simplify my `sed` code by adding the letter at the start rather than the end of the line; it would have less work to do. I'm not sure how measurable that would be (definitely a second-order effect, I think). _[…time passes…]_ Hmm; not negligible; it reduced from 49 seconds to 38 seconds, for better than 25% improvement (considerably more than I expected). The script generating command is now `sed 's%.%s/^/&/p;s/^&//%' > peptides.sed`. – Jonathan Leffler Oct 16 '15 at 23:03
  • Misclaiming again — it's a better than 20% improvement, or just under 25% improvement. (Taking 11 seconds off 49; 12 seconds would be a 25% improvement). – Jonathan Leffler Oct 16 '15 at 23:11
  • Really cool, I wish I could upvote again. It's even cooler now that I finally figured out what it does:) Thanks! – Andras Deak -- Слава Україні Oct 16 '15 at 23:30
1

crunch is available on Kali distributions

crunch 7 7 ADEFGHIKLMNPQRSTVWY
user123456
  • 364
  • 1
  • 3
  • 16