I took a look at the idea of (ab?)using brace expansion:
p='{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}'
eval echo $p$p$p$p$p$p$p
Using such direct approach with all in one simple step of 7 $p
's is just too much for bash.
For no apparent reason, it eats all the memory (measurements with time show no other memory value increase so quickly).
The command is quite quick and amazingly simple for up to about 4 $p
's, just two lines:
p='{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}'
eval echo $p$p$p$p
however, memory usage grows quite quickly. At a depth of 6 $p
repetitions the process eats more than 7.80 Gigs of memory.
The eval part also helps to increases execution time and memory usage.
An alternative approach was needed. So, I tried to make each step of expansion on its own, taking advantage of the concept that Jonathan Leffler used. For each line in the input, write 19 lines, each with an additional letter to the output. I found out that any eval is an important memory drain (not shown here).
Bash
A simpler bash filter is:
bashfilter(){
while read -r line; do
printf '%s\n' ${line}{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
done </dev/stdin
}
Which could be used for several levels of processing as:
echo | bashfilter | bashfilter | bashfilter
It just needs to repeat as many filter steps as letters per line are needed.
With this simpler approach: Memory was no longer an issue.
Speed, however, got worse.
Leffler SED
Just to compare, use it as a measuring stick, I implemented Leffler idea:
# Building Leffler solution:
leftext="$(<<<"${list}" sed -e 's/,/\n/g')" # list into a column.
leftext="$(<<<"${leftext}" sed -e 's%.%s/$/&/p;s/&$//%')" # each line ==> s/$/?/p;s/?$//
# echo -e "This is the leffilter \n$leftext"
leffilter(){ sed -ne "$leftext"; } # Define a function for easy use.
And is the leffilter which could be used recursively to get as many letters per line as needed:
echo | leffilter | leffilter | leffilter
Leffler solution did a letter insert and a letter erase.
SED
The amount of work could be cut down by not needing to erase one letter. We could store the original pattern space in the "hold space".
Then, just copy the first line to the hold space (h), and keep restoring it (g) and inserting just one letter.
# Building a sed solution:
sedtext="$(<<<"${list}" sed -e 's/,/\n/g')" # list into a column.
sedtext="$(<<<"${sedtext}" sed -e 's%[A-Z]%g;s/$/&/p;%g')" # s/$/?/p
sedtext="$(<<<"${sedtext}" sed -e '1 s/g/h/' )" # 1st is h
sedfilter(){ sed -ne "$sedtext"; } # Define a function for easy use.
Doing this makes speed better, about 1/3 (33%) lower. Or 1.47 times faster.
AWK
Finally, I present an AWK solution. I wrote it earlier, but is the fastest.
And so I present it as the last option. The best till someone presents a better one :-)
# An AWK based solution:
awkfilter(){ awk 'BEGIN { split( "'"$list"'",l,",");}
{ for (i in l) print $0 l[i] }'
}
Yep, just two lines. It is half the time or twice as fast as Leffler solution.
The full test script used follows. It re-calls itself to enable the use of the external time. Make sure it is an executable file with bash.
#!/bin/bash
TIMEFORMAT='%3lR %3lU %3lS'
list="A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y"
# A pure bash based solution:
bashfilter(){
while read -r line; do
printf '%s\n' ${line}{A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
done </dev/stdin
}
# Building Leffler solution:
leftext="$(<<<"${list}" sed -e 's/,/\n/g')" # list into a column.
leftext="$(<<<"${leftext}" sed -e 's%.%s/$/&/p;s/&$//%')" # each line ==> s/$/?/p;s/?$//
# echo -e "This is the lef filter \n$leftext"
leffilter(){ sed -ne "$leftext"; } # Define a function for easy use.
# Building a sed solution:
sedtext="$(<<<"${list}" sed -e 's/,/\n/g')" # list into a column.
sedtext="$(<<<"${sedtext}" sed -e 's%[A-Z]%g;s/$/&/p;%g')" # each letter ==> s/$/?/p
sedtext="$(<<<"${sedtext}" sed -e '1 s/g/h/' )" # First command is 'h'.
# echo -e "This is the sed filter \n$sedtext"
sedfilter(){ sed -ne "$sedtext"; } # Define a function for easy use.
# An AWK based solution:
awkfilter(){ awk 'BEGIN { split( "'"$list"'",l,",");}
{ for (i in l) print $0 l[i] }'
}
# Execute command filter
docommand(){
local a count="$1" filter="$2" peptfile="$3"
for (( i=0; i<count; i++ )); do
case $filter in
firsttry) a+=("{$list}"); ;;
*) a+=("| $filter"); ;;
esac
done
[[ $filter == firsttry ]] && a+=('| sed '"'"'s/ /\n/'"'" )
[[ -n $peptfile ]] && peptfile="$peptfile.$count"
eval 'echo '"$(printf '%s' "${a[@]}")" > "${peptfile:-/dev/null}";
}
callcmd(){
tf='wall:%e s:%S u:%U (%Xtext+%Ddata %F %p %t %Kmem %Mmax)'
printf '%-12.12s' "$1" >&2
/usr/bin/time -f "$tf" "$0" "$repeats" "$1" "$2"
}
nofile=1
if (( $#>=2 )); then
docommand "$1" "$2" "$3"; exit 0
else
for (( i=1; i<=6; i++)); do
repeats=$i; echo "repeats done = $repeats"
if ((nofile)); then
callcmd firsttry
callcmd bashfilter
callcmd leffilter
callcmd sedfilter
callcmd awkfilter
else
callcmd firsttry peptidesF
callcmd bashfilter peptidesB
callcmd leffilter peptidesL
callcmd sedfilter peptidesS
callcmd awkfilter peptidesA
fi
done
fi
RESULTS
The external program /usr/bin/time was used (instead of the bash builtin time) to be able to measure the memory used. It was important in this problem.
With: tf='wall:%e s:%S u:%U (%Xtext+%Ddata %F %p %t %Kmem %Mmax)'
The results for 7 loops and true file output are easy to find with the above script, but I feel that filling about 21 gigs of disk space was just too much.
The results up to 6 loops were:
repeats done = 1
firsttry wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
bashfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1552max)
leffilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
sedfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
awkfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
:
repeats done = 2
firsttry wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
bashfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1552max)
leffilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
sedfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
awkfilter wall:0.01 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
:
repeats done = 3
firsttry wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1796max)
bashfilter wall:0.07 s:0.00 u:0.05 (0text+0data 0 0 0 0mem 1552max)
leffilter wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
sedfilter wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1560max)
awkfilter wall:0.02 s:0.00 u:0.00 (0text+0data 0 0 0 0mem 1556max)
:
repeats done = 4
firsttry wall:0.28 s:0.01 u:0.26 (0text+0data 0 0 0 0mem 25268max)
bashfilter wall:0.96 s:0.03 u:0.94 (0text+0data 0 0 0 0mem 1552max)
leffilter wall:0.13 s:0.00 u:0.12 (0text+0data 0 0 0 0mem 1560max)
sedfilter wall:0.10 s:0.00 u:0.08 (0text+0data 0 0 0 0mem 1560max)
awkfilter wall:0.09 s:0.00 u:0.07 (0text+0data 0 0 0 0mem 1560max)
:
repeats done = 5
firsttry wall:4.98 s:0.36 u:4.76 (0text+0data 0 0 0 0mem 465100max)
bashfilter wall:20.19 s:0.81 u:20.18 (0text+0data 0 0 0 0mem 1552max)
leffilter wall:2.43 s:0.00 u:2.50 (0text+0data 0 0 0 0mem 1556max)
sedfilter wall:1.83 s:0.01 u:1.87 (0text+0data 0 0 0 0mem 1556max)
awkfilter wall:1.49 s:0.00 u:1.54 (0text+0data 0 0 0 0mem 1560max)
:
repeats done = 6
firsttry wall:893.06 s:30.04 u:105.22 (0text+0data 402288 0 0 0mem 7802372m)
bashfilter wall:365.13 s:14.95 u:368.09 (0text+0data 0 0 0 0mem 1548max)
leffilter wall:51.90 s:0.09 u:53.91 (0text+0data 6 0 0 0mem 1560max)
sedfilter wall:35.17 s:0.08 u:36.67 (0text+0data 0 0 0 0mem 1556max)
awkfilter wall:25.60 s:0.06 u:26.77 (0text+0data 1 0 0 0mem 1556max)