using awk to find pattern if line starts with ">" and add at the end of it the number of occurences of the pattern

Question

I have been struggling with awk to figure out a way to find identical patterns and add a tag at the end of them showing how many times they are present in the file. For example, if Spiroplasma_culicicola occurs 7 times, then next to the first occurrence, it should write Spiroplasma_culicicola_1, next to the second occurrence Spiroplasma_culicicola_2 next to the third occurrence Spiroplasma_culicicola_3 etc etc

However I have a fasta file that looks like this:

>Spiroplasma_taiwanense
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...

so I would like to add the "tag", the number showing occurences only next to the headers! therefore the above file should look like:

>Spiroplasma_taiwanense_1
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum_1
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum_2
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...

Based on a previous answered question I figured that I should use awk, with sth like this: awk '$1 ~ /^>/ {gsub(" ", "", $0); a[$0]++; print $0"_"a[$0]}'

(code stolen from here:find the number of occurences and add it next to the pattern)

However I cant find a way to save the changes in the file (for example like sed with -i) and I cant redirect it to a new file cause then it simply prints/saves the headers.

Any ideas?

thanks P

so the issue is *However I cant find a way to save the changes in the file* ? and no issues with pattern matching? — RomanPerekhrest, Mar 21 '17 at 13:20
yes because the above awk command returns the following: Spiroplasma_taiwanense_1 Spiroplasma_diminutum_1 Spiroplasma_diminutum_2 — Panos, Mar 21 '17 at 13:21
to just write the output to a new file use output redirection like here http://stackoverflow.com/questions/14660079/how-to-save-the-output-of-this-awk-command-to-file — RomanPerekhrest, Mar 21 '17 at 13:30
In latest GAwk (since 4.1.0 released), option of `-i inplace` for file editing. https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html — Jose Ricardo Bustos M., Mar 21 '17 at 13:37
@Panos wrt `However I cant find a way to save the changes in the file (for example like sed with -i) and I cant redirect it to a new file cause then it simply prints/saves the headers.` - it's not clear if your problem is a) you don't know how to save the output back to the original file or b) the output is only headers and not the whole text. Which of "a" or "b" is the problem you're asking for help with? — Ed Morton, Mar 21 '17 at 13:53
thanks guys, its what Tom wrote below, I wanted to change the headers but leave the rest of the text unchanged, thanks to everyone now days/weeks of work will simply take one day or less :) — Panos, Mar 21 '17 at 14:17

Tom Fenech · Accepted Answer · 2017-03-21T13:58:06.613

It seems the problem is that you don't understand the code you have found elsewhere:

awk '$1 ~ /^>/ {gsub(" ", "", $0); a[$0]++; print $0"_"a[$0]}'

By the looks of things, it performs the substitution that you want and prints the lines that start with >.

So the missing part is to print the rest of the lines without making any modification.

You could do it like this:

awk '$1 ~ /^>/ { gsub(" ", "", $0); a[$0]++; $0 = $0"_"a[$0] } { print }'

That is, change the print to an assignment in the first block and add an unconditional second block which always prints everything.

The code can be further simplified, by combining the increment with the assignment and changing { print } to the common shorthand (just a 1 condition with the default action, print).

As mentioned in the comments, the call to gsub can be improved by passing a regex literal as the first argument, as opposed to a string which must be converted to a regex before use. It can also be shortened by removing the final argument $0 which is the default.

awk '$1 ~ /^>/ { gsub(/ /, ""); $0 = $0 "_" ++a[$0] } 1'

To overwrite the original file, just redirect to a temporary file then overwrite the original:

awk '...' input > tmp && mv tmp input

Or with GNU awk, as mentioned in the comments:

awk -i inplace '...' input

Yes Tom I m sorry, I dont really have any experience with awk, but thanks for the help! — Panos, Mar 21 '17 at 13:49

using awk to find pattern if line starts with ">" and add at the end of it the number of occurences of the pattern

1 Answers1

Linked