I am trying to alter a column/field within a 'header' line of DNA sequences that is thousands of lines long. Specifically, I want to change the first field of the header (compX_seqy), which ALWAYS starts with ">":
An example of just the first two sequences:
#cat example
>comp0_seq1 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...
I would like to add an "A" to the end of the first column only, for ALL lines starting with ">",
comp0_seq1A
Then print the rest of the line and then next lines (sequences) until the next ">" line is reached (and repeat).
I want the output to look like this :
>comp0_seq1A 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2A 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...
I tried this first:
awk '$1=$1"A"' example
>comp0_seq1A 444 [12:23]
AGAGGACACA
GATCCAACATAA
AGASCACA
>comp0_seq2A 333 [12:32:599:1]
GTCGATCA
CYAACYA
CCCCAA
A
A
It adds an A to the first field of all lines, So not quite.
Then I tried this, using a regex to replace only lines starting with ">"
# awk '/^>/ {print $1=$1"A";getline;print $0}' example
>comp0_seq1A
AGAGGACAC
>comp0_seq2A
GTCGATC
But that only prints the first line AFTER the match. So, how to print all/any lines AFTER the match/replacement, and until the next ">"? I tried to use 'next', but I guess I dont understand how to use it in this context.
Any advice? I know I am close and am banging my head on my keyboard.
Thx, LP.