I have a RNA-seq data that looks like this:
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--
If we treat all content after @ as a section, you can see only the second line is the real sequencing information, 1,3,4,5 are logistic/quality information.
The goal is to extract sequences (second line information) that containing "GCTGCA" in the first N (N=35) characters each line, and at the same time output the surrounding lines (1 line ahead, 3 line behind the matched line).
An example answer is
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
What I have tried are
awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq
The first one cannot output surrounding lines. The second one cannot limit pattern-matching in the first 35 letters of each line. The third line should work, but it gives me wired output (which obviously is not correct):
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--