I have DNA sequence data in FASTQ format, which takes a 4-line format per record:
@sequence-header-information
sequence
+
quality-scores
Each character in the sequence line has a corresponding character in the quality score line. All the sequences contain an 8-9 base barcode sequence, followed by a primer region that denotes the start of the biological sequence. In front of the barcode, there may or may not be some junk bases present. I need to separate the sequences (with their associated information) into separate files according to barcode, and remove everything before the primer. For the sequence itself, I can do this with grep and sed (a and b are variables set by looping over a file of barcode sequences with sample numbers; the regexp hits the primer):
a=AAAGCG
b=1
cat DNA.fastq | grep -A2 -B1 -E "${a}""CCTACGGG[ACGT]{1}GGC[AT]{1}GCAG" | grep -v "^--$" | sed -E 's/.+(CCTACGGG[ACGT]{1}GGC[AT]{1}GCAG)/\1/' > sample_${b}.fastq
This will work nicely on input data like this, where I want to remove AAAGCG from the start of line 2 and ATAAAGCG from the start of line 6):
@M00816:90:000000000-AE7TD:1:2116:19022:11483 1:N:0:1
AAAGCGCCTACGGGTGGCAGCAGTGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCACTTTTGTCAGGGAAGAAACGGTCTGAACTAATATTTCGGACTAATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCAGCTTTGCAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGCACTGCA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFCFGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGFGGGGGGCFFGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGCAGGGG?FGGGGGGGFGGGG7FG7CGGGGGGGGGGGGEGGGFFFG585C98DEEG?CFG*<<@>37;68CEFCFGF99EGGEEDGE54<9C<8CC>*854C<.
@M00816:90:000000000-AE7TD:1:2118:10209:10682 1:N:0:1
ATAAAGCGCCTACGGGGGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGCAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAACCTGGATAGCTTGAATCAT
+
--CCCCCGGGGGGGGGGGGGGGGGGFGGGGDEGGGGGGFGGGFGGGGGGGGFGGGG8AEA9DF@FGGGGGGEFGFGFFCFGGG,@FFFG<,<FFGGGGGGDGGGGGGGGFGG7BEE>EDGGGFGFGGFFGGGGGGGGGGGCFGGGC:CGFGGG8,<CEEGGGGFGGGFCE5CEEGGGGGGGGGGGG88C5@CEEE5CCFGGGGGGFGCFGGCG:>G>@FGDCG=F?*1+?E>EE@F@FFG9<9EGGGGG477AFFFDG69F=04C89?FGG7C6CFGG@:EECC8/;;EGC?EC>FF:DA74)
The problem is how to then remove the right number of characters from the start of the quality score. In the example above, I need to remove 6 characters from the start of line 4 (CCCCCG) and 8 from line 8 (--CCCCCG). This could be done by counting the number of bases before the barcode & primer in lines 2 & 4 and removing this number of characters from the score. Alternatively, a second command could be used on the output files to count the length of each sequence (lines 2 & 4 in sample_1.fastq) and trim the quality score to the same length. The problem with both is that they require storing information from one line and using it in a process on a subsequent line.
Idea 1: store sequence lengths in separate text file
cat sample_1.fastq | sed -n '2~4p' | awk 'BEGIN{print length($0)} > seq_lengths.txt'
The problem with this is I can't think how to loop over each line of the file simultaneously with each record in sample_1.fastq (as opposed to nested loops).
Idea 2:
cat sample_1.fastq | awk '$0 == "+" {
a=length $NR-1
b=gensub(/^.{a}/, 1, $NR+1)
print $(NR-2), $(NR-1), $0, b
}' > sample1test.fastq
Testing this revealed that length $NR-1 gives the length -1, not the length of the previous line! (I'm still an awk novice.)
Idea 3: Count the length of each sequence line in sample_1.fastq and print it in next line. Cat the file again, grep -B2 -A2 this number and store as a variable. Use sed to trim the quality score to the value of the variable and remove the extra line. This would require storing a variable without disrupting the flow through a pipe, which by my understanding isn't possible.
Idea 4: Set multiples of 2 as var1, and multiples of 4 as var2. Use awk to print the length of sample_1.fastq line number var1 and set that as var3. Trim line var2 to length var3. This would require being able to call lines by number, which I understand also isn't possible.
Any ideas or insights would be greatly appreciated! Many thanks.