5

I am trying to alter a column/field within a 'header' line of DNA sequences that is thousands of lines long. Specifically, I want to change the first field of the header (compX_seqy), which ALWAYS starts with ">":

An example of just the first two sequences:

 #cat example

 >comp0_seq1 444 [12:23]
 AGAGGACAC
 GATCCAACATA
 AGASCAC
 >comp0_seq2 333 [12:32:599:1]
 GTCGATC
 CYAACY
 CCCCA
 ...

I would like to add an "A" to the end of the first column only, for ALL lines starting with ">",

comp0_seq1A

Then print the rest of the line and then next lines (sequences) until the next ">" line is reached (and repeat).

I want the output to look like this :

>comp0_seq1A 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2A 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...

I tried this first:

awk '$1=$1"A"' example

>comp0_seq1A 444 [12:23]
AGAGGACACA
GATCCAACATAA
AGASCACA
>comp0_seq2A 333 [12:32:599:1]
GTCGATCA
CYAACYA
CCCCAA
A
A

It adds an A to the first field of all lines, So not quite.

Then I tried this, using a regex to replace only lines starting with ">"

# awk '/^>/ {print $1=$1"A";getline;print $0}' example
>comp0_seq1A
AGAGGACAC
>comp0_seq2A
GTCGATC

But that only prints the first line AFTER the match. So, how to print all/any lines AFTER the match/replacement, and until the next ">"? I tried to use 'next', but I guess I dont understand how to use it in this context.

Any advice? I know I am close and am banging my head on my keyboard.

Thx, LP.

LP_640
  • 579
  • 1
  • 5
  • 17

1 Answers1

10

You've almost got it. You're just overthinking things with your getline.

In awk, the following should work:

$ awk '/^>/ {$1=$1"A"} 1' file.txt

This works by running the command in curly braces on all lines that match the regular expression ^>. The 1 at the end is awk short-hand that says "print the current line".

Another option for a substitution this simple would be to use sed:

$ sed '/^>/s/ /A /' file.txt

This works by searching for lines that match the same regex, then replacing the first space with a string (/A /). sed will print each line by default, so no explicit print is required.

Or if you prefer something that substitutes the first "field" rather than the first "field separator", this can work:

$ sed 's/^\(>[^ ]*\)/\1A/' file.txt

By default, sed regexes are "BRE", so the grouping brackets need to be escaped. The \1 is a reference to the first (in this case "only") bracketed expression in the search regex.

ghoti
  • 45,319
  • 8
  • 65
  • 104
  • Good answer. With sed, I'd write `sed '/^>[^[:blank:]]\+/s//&A/'` , using the "blank" character class in case there are tabs in that file. – glenn jackman Nov 07 '16 at 17:07
  • Thanks for all the options. So simple using sed/substitute to replace the space with the additional character. – LP_640 Nov 07 '16 at 18:18
  • @glennjackman - ah, great suggestion to use `&` as well. I'll leave my answer as-is, as it seems to work with the OP's data, but thank you for the comment; it'll undoubtedly help other folks who may have similar-but-not-identical problems. – ghoti Nov 07 '16 at 18:28