awk combine info from two files (fasta file header)

Question

I know there are many similar questions, and I had read through many of them. But I still can't make my code work. Could somebody point the problem out for me please? Thanks!

(base) $ head Sample.pep2
>M00000032072 gene=G00000025773 seq_id=ChrM type=cds
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

$ head -n 3 mRNA.function
M00000032074 locus=g17091;makerName=TCONS_00021197.p2
M00000032073 Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
M00000032072 Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1

I would like the output

>M00000032072 gene=G00000025773 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

and my command is awk 'NR==FNR{id[$1]=$2; next} /^>/ {print $0=$0,id[$1]}' mRNA.function Sample.pep2. But it doesn't do the job... I don't know where it is wrong...

for the 2nd file `$1` includes a leading `>` which means you won't find any matches in `id[]`; you'll need to strip off the `>` to access a matching entry in `id[]`; alternatively, when processing the 1st file you could prepend a `>` to the index, eg, `id[">"$1]=$2` and then `id[$1]` should work for the 2nd file — markp-fuso, May 04 '22 at 20:14

score 2 · Answer 1 · answered May 04 '22 at 20:24

Here is a perl solution.

perl -lpe 'BEGIN { %id_to_function = map { /^(\S+)\s+(.*)/ } `cat mRNA.function`; } s{^>(\S+)(.*)}{>$1$2 $id_to_function{$1}};' sample.pep2

Prior to read the fasta file, the code executes the BEGIN { ... } block. There, the file with ids and functions is read into the hash %id_to_function.
In the main body of the code, the substitution operator s{...}{...} appends to the fasta header the function for the corresponding id using the hash lookup $id_to_function{$1}.
$1 and $2 are the first and second capture groups, correspondingly, that were captured with parentheses in the preceding regex: ^>(\S+)(.*).

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Thanks!! I'm new to perl, but your clear explanation could be my start to perl! Thank you! — zzz, May 05 '22 at 10:24

score 1 · Accepted Answer · answered May 04 '22 at 20:04

1

Current code is close, just need to a) match the first field (Sample.pep2) with the corresponding array entry (if it exists) and b) make sure we print all input lines:

awk 'NR==FNR{id[$1]=$2; next} /^>/ {key=substr($1,2); if (key in id) $0=$0 OFS id[key]} 1' mRNA.function Sample.pep2

This generates:

>M00000032072 gene=G00000025773 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

answered May 04 '22 at 20:04

markp-fuso

28,790
4
16
36

Thanks!! I like your comment above by editing `id[$1]=$2` to `id[">"$1]=$2`, and it does the trick! In this answer, you use this part `if (key in id)`, and I understand what it means. Just a curious question: if we don't have this `if` part, will the awk command not trying to find the match? :) – zzz May 05 '22 at 10:36
Also, I notice that if it's `awk 'NR==FNR{id[">"$1]=$2; next} /^>/ {print $0=$0,id[$1]}' mRNA.function Sample.pep2`, I will only get the header but no sequence. Why does that happen? – zzz May 05 '22 at 11:10
1

yes, if you remove the `if (key in id)` and just do `$0=$0 FS id[key]` it will still do the lookup but will also print extra space on the end of the line in the case `id[key]` does not exist; your proposed code modification does not print the sequence because you're only printing lines that match `/^>/`, notice in the answer the standalone `1` - `/^>/ { if ...} 1'`, the standalone `1` is shorthand for `always true so print current line` – markp-fuso May 05 '22 at 12:41
1

@zzz one other item ... anytime you reference `id[key]`, if the array entry does not exist it will be created (albeit with no value assigned to the array entry); by prefacing the reference with `if (key in id)` we first test if the array entry exists and if it does not then we skip the reference to `id[key]` and so we don't create the (empty) array entry; by testing for the existence of the index *first* we can save on memory usage (not a big issue in this case) and insure follow-on code does not wrongly see the array entry ... – markp-fuso May 05 '22 at 13:37
1

for example, this code - `awk 'BEGIN {print id[1]; if (1 in id) printf "it exists with value of .%s.", id[1]}'` - never explicitly assigns a value to `id[1]` but the mere reference to it (`print id[1]`) actually creates an entry so the follow-on conditional (`if (1 in id)`) evaulates as true – markp-fuso May 05 '22 at 13:40
Thanks a lot! This clarification of difference in codes is really helpful! – zzz May 05 '22 at 19:25

score 1 · Answer 3 · answered May 05 '22 at 09:17

1

awk '
NR==FNR {a[">"$1]=$2}
NR!=FNR && a[$1] != "" {$0=$0" "a[$1]}
NR!=FNR' mRNA.function Sample.pep2

You were on the right track with the array. In file2 you need to check that the line starts with the matching name from file1, before appending its data. That's what a[$1] != "" does.

This example assumes the first file only has two fields (no spaces in the data). If there are spaces I can post an edit.

answered May 05 '22 at 09:17

dan

4,846
6
15

Thanks! May I ask why we use so many NR=FNR or NR!=FNR? I always have problem to understand the NR and FNR thing. Thanks!! – zzz May 05 '22 at 10:38
1

Sure. It's used to check if we are processing the first file or not. A "record" in awk is a line by default. `FNR` is the record number of the current file. `NR` is the record number over all files. On the first file, `NR` is equal(`==`) to `FNR`. When we get to the second file, `FNR` goes back to one, but `NR` keeps incrementing. If `NR` is not equal (`!=`) to `FNR` we know we have finished reading the first file. So in the answer, the first line operates on the first file, the last two lines operate on the second file. – dan May 05 '22 at 11:04
That's a great answer! super clear1 Thanks a lot!! – zzz May 05 '22 at 11:43

awk combine info from two files (fasta file header)

3 Answers3