0

I have a fasta file with a sequences (file with text) like:

file.fasta

>seq_1
AGCTAATACTTGTCCACGTTGTACTTCTTCACGAGAAACACCACGTAATAAAGCACCGAT
GTTATCTCCAGCTTCAGCGTAATCTAATAATTTACGGAACATTTCTACACCTGTAACTGT
AGTTTTAGCTGGCTCTTCAGTTAAACCGATGATTTCAACTTCTTCACCAACTTTAACTTG
TCCACGCTCAACACGTCCAGTTGCAACTGTACCACGACCAGTGATTGAGAATACGTCCTC
AACTGGCATCATGAATGGTTTGTCAGAATCACGTTCTGGAGTTGGGATGTACTCATCAAC
TGCGTTCATTAATTCCATGATTTTTTCTTCGTACTCTTCAACGCCTTCTAATGCTTTTAA
AGCAGATCCAGCGATTACAGGTACATCGTCACCAGGGAAGTCATATTCAGATAATAAGTC
ACGAACTTCC
>seq_2
AGCTAATACTTGTCCACGTTGTACTTCTTCACGAGAAACACCACGTAATAAAGCACCGAT
GTTATCTCCAGCTTCAGCGTAATCTAATAATTTACGGAACATTTCTACACCTGTAACTGT
AGTTTTAGATGGCTCTTCAGTTAAACCGATGATTTCAACTTCTTCACCAACTTTAACTTG
TCCACGCTCAACACGTCCAGTTGCAACTGTACCACGACCAGTGATTGAGAATACGTCCTC
AACTGGCATCATGAATGGTTTGTCAGAATCACGTTCTGGAGTTGGGATGTACTCATCAAC
TGCGTTCATTAATTCCATGATTTTATCTTCGTACTCTTCAACGCCTTCTAATGCTTTTAA
AGCAGATCCAGCGATTACAGGTACATCGTCACCAGGGAAGTCATATTCAGATAATAAGTC
ACGAACTTCC
>seq_3
AGCTAATACTTGTCCACGTTGTACTTCTTCACGAGAAACACCACGTAATAAAGCACCGAT
GTTATCTCCAGCTTCAGCGTAATCTAATAATTTACGGAACATTTCTACACCTGTAACTGT
AGTTTTAGATGGCTCTTCAGTTAAACCGATGATTTCAACTTCTTCACCAACTTTAACTTG
TCCACGCTCAACACGTCCAGTTGCAACTGTACCACGACCAGTGATTGAGAATACGTCCTC
AACTGGCATCATGAATGGTTTGTCAGAATCACGTTCTGGAGTTGGGATGTACTCATCAAC
TGCATTCATTAATTCCATGATTTTATCTTCGTACTCTTCAACGCCTTCTAATGCTTTTAA
AGCAGATCCAGCGATTACAGGTACATCGTCACCAGGGAAGTCATATTCAGATAATAAGTC
ACGAACTTCC

............
>seq_n
AGCAGATCCAGCGATTACAGGTACATCGTCACCAGGGAAGTCATATTCAGATAATAAGTC
..............

So I want to calculate the average length of the strings avoiding the lines with >seq_, my code to obtain the length of each line is:

array_length=$(awk '/^>/ {print n $0; n="\n"}; !/^>/ {printf "%s", $0} END {print ""}' My_file.fasta | awk '!/^>/ {print length(), $0}' | sort -n| awk '{print $1}')

until here everything is ok, I got the fist column that correspond to the length of each string:

echo "$array_length"

203
207
222
231
232
243
255
258
261
268
279
291
307
316

.....

161581
208146
242398
259601
288468
301866
427209
531340
557978
840257

well the length in the array could be variable, in this case I just show part of them.

my problem is that I want to calculate the average of the $array_length (sum of all numbers/length of the array)

A second question is how to take the fist element of the array and the last one; in order to do that, I just add a tail -1 and head -n 1 to the end of the code

awk '/^>/ {print n $0; n="\n"}; !/^>/ {printf "%s", $0} END {print ""}' My_file.fasta | awk '!/^>/ {print length(), $0}' | sort -n| awk '{print $1}' | tail -1
awk '/^>/ {print n $0; n="\n"}; !/^>/ {printf "%s", $0} END {print ""}' My_file.fasta | awk '!/^>/ {print length(), $0}' | sort -n| awk '{print $1}' | head -n 1

I know that, with a file I do it like

cat file.txt | tail -1
cat file.txt | head -n 1

But I dont want to use the same code twice to obtain the $small_one (203) and $big_one (840257), I just want to take the fist and last element of the variable $array_length like the one that I show here, how can I do it?

Cyrus
  • 84,225
  • 14
  • 89
  • 153
abraham
  • 661
  • 8
  • 14
  • 1
    Note that for a large file, `cat file.txt | tail -n 1` is **vastly** slower than `tail -n 1 file.txt` or `tail -n 1 – Charles Duffy Feb 19 '21 at 00:13
  • Anyhow -- as I read it, this question isn't about calculating averages so much as it's about efficiently getting both the first and last line of a file while reading it only once -- correct? – Charles Duffy Feb 19 '21 at 00:16
  • (if those are two separate questions, they should be **asked** as two separate questions; putting two unrelated questions in one Stack Overflow question is against the rules, and means that one question can be closed as "too broad") – Charles Duffy Feb 19 '21 at 00:18
  • ...that said, as a quick answer for the second question: `{ IFS= read -r first; last=$(tail -n 1); } – Charles Duffy Feb 19 '21 at 00:24
  • @biqarboy, btw -- the edit you proposed is itself a good one, but the first edit to a closed question puts it into the review queue for reopening, so ideally the first edit to a closed question should be sufficient to bring it within site rules so it can actually be reopened; otherwise, that one "free" reopen-queue chance has potential to be wasted. – Charles Duffy Feb 19 '21 at 00:26
  • @Charles Duffy, Thank you! I got it now. I actually did not notice that this post was closed. It's really bad wasting `re-open queue slot`. Sorry for the inconvenience. – biqarboy Feb 19 '21 at 05:01

0 Answers0