Counting and removing characters in different lines

Question

I have DNA sequence data in FASTQ format, which takes a 4-line format per record:
@sequence-header-information
sequence
+
quality-scores

Each character in the sequence line has a corresponding character in the quality score line. All the sequences contain an 8-9 base barcode sequence, followed by a primer region that denotes the start of the biological sequence. In front of the barcode, there may or may not be some junk bases present. I need to separate the sequences (with their associated information) into separate files according to barcode, and remove everything before the primer. For the sequence itself, I can do this with grep and sed (a and b are variables set by looping over a file of barcode sequences with sample numbers; the regexp hits the primer):

a=AAAGCG
b=1 
cat DNA.fastq | grep -A2 -B1 -E "${a}""CCTACGGG[ACGT]{1}GGC[AT]{1}GCAG" | grep -v "^--$" | sed -E 's/.+(CCTACGGG[ACGT]{1}GGC[AT]{1}GCAG)/\1/' > sample_${b}.fastq

This will work nicely on input data like this, where I want to remove AAAGCG from the start of line 2 and ATAAAGCG from the start of line 6):

@M00816:90:000000000-AE7TD:1:2116:19022:11483 1:N:0:1
AAAGCGCCTACGGGTGGCAGCAGTGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCACTTTTGTCAGGGAAGAAACGGTCTGAACTAATATTTCGGACTAATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCAGCTTTGCAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGCACTGCA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFCFGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGFGGGGGGCFFGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGCAGGGG?FGGGGGGGFGGGG7FG7CGGGGGGGGGGGGEGGGFFFG585C98DEEG?CFG*<<@>37;68CEFCFGF99EGGEEDGE54<9C<8CC>*854C<.
@M00816:90:000000000-AE7TD:1:2118:10209:10682 1:N:0:1
ATAAAGCGCCTACGGGGGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGCAAGCCTGGAGCTCAACTCCAGAACTGCCTTTGAACCTGGATAGCTTGAATCAT
+
--CCCCCGGGGGGGGGGGGGGGGGGFGGGGDEGGGGGGFGGGFGGGGGGGGFGGGG8AEA9DF@FGGGGGGEFGFGFFCFGGG,@FFFG<,<FFGGGGGGDGGGGGGGGFGG7BEE>EDGGGFGFGGFFGGGGGGGGGGGCFGGGC:CGFGGG8,<CEEGGGGFGGGFCE5CEEGGGGGGGGGGGG88C5@CEEE5CCFGGGGGGFGCFGGCG:>G>@FGDCG=F?*1+?E>EE@F@FFG9<9EGGGGG477AFFFDG69F=04C89?FGG7C6CFGG@:EECC8/;;EGC?EC>FF:DA74)

The problem is how to then remove the right number of characters from the start of the quality score. In the example above, I need to remove 6 characters from the start of line 4 (CCCCCG) and 8 from line 8 (--CCCCCG). This could be done by counting the number of bases before the barcode & primer in lines 2 & 4 and removing this number of characters from the score. Alternatively, a second command could be used on the output files to count the length of each sequence (lines 2 & 4 in sample_1.fastq) and trim the quality score to the same length. The problem with both is that they require storing information from one line and using it in a process on a subsequent line.

Idea 1: store sequence lengths in separate text file

cat sample_1.fastq | sed -n '2~4p' | awk 'BEGIN{print length($0)} > seq_lengths.txt'

The problem with this is I can't think how to loop over each line of the file simultaneously with each record in sample_1.fastq (as opposed to nested loops).

Idea 2:

cat sample_1.fastq | awk '$0 == "+" {
a=length $NR-1
b=gensub(/^.{a}/, 1, $NR+1)
print $(NR-2), $(NR-1), $0, b
}' > sample1test.fastq

Testing this revealed that length $NR-1 gives the length -1, not the length of the previous line! (I'm still an awk novice.)

Idea 3: Count the length of each sequence line in sample_1.fastq and print it in next line. Cat the file again, grep -B2 -A2 this number and store as a variable. Use sed to trim the quality score to the value of the variable and remove the extra line. This would require storing a variable without disrupting the flow through a pipe, which by my understanding isn't possible.

Idea 4: Set multiples of 2 as var1, and multiples of 4 as var2. Use awk to print the length of sample_1.fastq line number var1 and set that as var3. Trim line var2 to length var3. This would require being able to call lines by number, which I understand also isn't possible.

Any ideas or insights would be greatly appreciated! Many thanks.

You can avoid the need to pipe with cat by placing the filename at the end of the awk statement before the re-direct. Also length needs to be used as a function i.e. a=length($NR-1) — Raman Sailopal, Jul 17 '17 at 09:12
It's not clear what you want to do. Try to give an example showing what you need to count and what you have to remove from the lines. — weibeld, Jul 17 '17 at 09:44
This format is sufficiently advanced that I would strongly recommend using a better language like Python, Ruby or Java for this job. — l0b0, Jul 17 '17 at 10:50
I have added more detail to the example to hopefully make it clearer. — ecologysarah, Jul 17 '17 at 11:17

score 0 · Answer 1 · answered Jul 17 '17 at 13:43

This seems like it would be a not uncommon need, so the first thing I'd do is look for a pre-written utility which does it.

A couple of top Google hits:

fastqutils barcode_split - http://ngsutils.org/modules/fastqutils/barcode_split/
FASTX barcode_splitter - http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage
other options from this 7 year old Biostars questions - https://www.biostars.org/p/1615/

BTW Biostars might be a better place for questions like this.

Counting and removing characters in different lines

1 Answers1