0

I have a RNA-seq data that looks like this:

@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--

If we treat all content after @ as a section, you can see only the second line is the real sequencing information, 1,3,4,5 are logistic/quality information.

The goal is to extract sequences (second line information) that containing "GCTGCA" in the first N (N=35) characters each line, and at the same time output the surrounding lines (1 line ahead, 3 line behind the matched line).

An example answer is

@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--

What I have tried are

awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq

The first one cannot output surrounding lines. The second one cannot limit pattern-matching in the first 35 letters of each line. The third line should work, but it gives me wired output (which obviously is not correct):

@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--
Ziyi Li
  • 39
  • 3
  • Your formatting gets a little ugly at the end. Can you fix it? – glenn jackman Apr 24 '18 at 19:42
  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Super User](http://superuser.com/) or [Unix & Linux Stack Exchange](http://unix.stackexchange.com/) would be a better place to ask. – jww Apr 24 '18 at 23:26

2 Answers2

2

with gawk multi-char RS support.

awk -v RS='\n--' -F'\n' 'substr($2,0,35)~"GCTGCA"{print $0 RS}' file

you define the record with the record separator.

karakfa
  • 66,216
  • 7
  • 41
  • 56
1

With awk using getline:

search.awk:

substr($0,0,35)~"GCTGCA"  {
    print p # Print the previous line ...
    print # ... , current line ...
    for(i=0;i<=2;i++) { # ... and the 3 lines following it
        getline
        print
    }
}

# Store the previous line
{ p = $0 }

Call it like this:

awk -f search.awk input_file

Or without regular expressions and with a parameter:

search.awk

index(substr($0,0,35), search)  {
    print l
    print
    for(i=0;i<=2;i++) {
        getline
        print
    }
}

{ l = $0 }

Call it like

awk -v search="GCTGCA" -f search.awk input_file
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thank you for the answer. But I do not know how to build this search.awk. I copy and paste your code section to linux, it gives me error here: { buffer() } -bash: syntax error near unexpected token `}' Did I do this in the right way? – Ziyi Li Apr 24 '18 at 18:45
  • put it into a file called `search.awk` – hek2mgl Apr 24 '18 at 18:45