I have been struggling with awk to figure out a way to find identical patterns and add a tag at the end of them showing how many times they are present in the file. For example, if Spiroplasma_culicicola occurs 7 times, then next to the first occurrence, it should write Spiroplasma_culicicola_1, next to the second occurrence Spiroplasma_culicicola_2 next to the third occurrence Spiroplasma_culicicola_3 etc etc
However I have a fasta file that looks like this:
>Spiroplasma_taiwanense
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...
so I would like to add the "tag", the number showing occurences only next to the headers! therefore the above file should look like:
>Spiroplasma_taiwanense_1
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum_1
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum_2
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...
Based on a previous answered question I figured that I should use awk, with sth like this: awk '$1 ~ /^>/ {gsub(" ", "", $0); a[$0]++; print $0"_"a[$0]}'
(code stolen from here:find the number of occurences and add it next to the pattern)
However I cant find a way to save the changes in the file (for example like sed with -i) and I cant redirect it to a new file cause then it simply prints/saves the headers.
Any ideas?
thanks P