using awk to extract a specific pattern

Question

I explain my problem.

I have a huge file in gff format such that:

scaffold_31 AUGUSTUS    CDS 18857   19210   0.63    +   0   transcript_id "g56.t1"; gene_id "g56";
scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS    CDS 11374   11507   0.96    -   2   transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS    CDS 11586   11733   0.39    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS    CDS 5303    5323    0.83    -   0   transcript_id "g58.t1"; gene_id "g58";
scaffold_33 AUGUSTUS    CDS 5810    6034    0.97    -   0   transcript_id "g58.t1"; gene_id "g58";
scaffold_34 AUGUSTUS    CDS 1390    1805    0.87    +   1   transcript_id "g59.t1"; gene_id "g59";
scaffold_37 AUGUSTUS    CDS 15299   15390   0.91    -   2   transcript_id "g60.t1"; gene_id "g60";
scaffold_37 AUGUSTUS    CDS 15622   15826   0.88    -   0   transcript_id "g60.t1"; gene_id "g60";

an so on... And I would like to find a command to extract in one side transcrit where their FIRST CDS starts with a a codon phase 0 (the 7 th column), and those from which their FIRST CDS starts with a 1 or a 2. Then, I would like to get 3 files and here it would be:

First file: with the first CDS of the transcript starting in phase 0.

scaffold_31 AUGUSTUS    CDS 18857   19210   0.63    +   0   transcript_id "g56.t1"; gene_id "g56";
    scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id 
scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS    CDS 5303    5323    0.83    -   0   transcript_id "g58.t1"; gene_id "g58";
    scaffold_33 AUGUSTUS    CDS 5810    6034    0.97    -   0   transcript_id "g58.t1"; gene_id "g58";

The second with with the first CDS of the transcript starting in phase 1:

scaffold_34 AUGUSTUS    CDS 1390    1805    0.87    +   1   transcript_id "g59.t1"; gene_id "g59";

And the third with the first CDS of the transcript starting in phase 2:

scaffold_37 AUGUSTUS    CDS 15299   15390   0.91    -   2   transcript_id "g60.t1"; gene_id "g60";
    scaffold_37 AUGUSTUS    CDS 15622   15826   0.88    -   0   transcript_id "g60.t1"; gene_id "g60";

As you can see, since the transcrit for exemple transcript_id "g60.t1 has its first CDS starting with the phase 2, all the folowwing CDS belonging to this transcript has to be transfered to the same file.

Thanks for you help, I hope someone will find a solution :)? I thought that awk could help ?

score 0 · Answer 1 · answered Apr 20 '18 at 17:02

0

awk to the rescue!

$ awk '!($1 in a){fn = "phase_"$8; a[$1]} {print > fn}' file

I think you meant the 8th column.

answered Apr 20 '18 at 17:02

karakfa

66,216
7
41
56

HI, thanks for your help but when I run you code on my file it actually does nothing, I got one file exactly the same an another only with the above title from the first one... – Grendel Apr 20 '18 at 23:06

using awk to extract a specific pattern

1 Answers1