Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

Question

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.

How should I do that?

score 1 · Answer 1 · answered Nov 03 '19 at 13:31

The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq. This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.

score 1 · Answer 2 · edited Nov 08 '19 at 08:19

1

I ended up to use below codes in R and solved my problem.

library(ShortRead)

fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)


trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC", 
                       sread(reads_fq), invert = TRUE)]

writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)

edited Nov 08 '19 at 08:19

zx8754

52,746
12
114
209

answered Nov 05 '19 at 17:35

Dawud

51
3

Please accept as answer if this resolves the question. – zx8754 Nov 08 '19 at 08:20

score 0 · Answer 3 · answered Nov 04 '19 at 10:05

You can use the Python package biotite for it (https://www.biotite-python.org).

Let's say you have the following FASTQ file:

@Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G@?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7@+6!FDC&
+4=4+,@2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
@Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,@'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?@2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'@",8:#5%18&+3>'8:28+:5F0);E9<=,+

This is a script, that should do the work:

import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq


# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]

fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
    sequence = fastq_file.get_sequence(header)
    # Iterative over each of the consecutive sequences
    for consecutive_nuc in consecutive_nucs:
        # Find all indices, where a match was found
        matches = seq.find_subsequence(sequence, consecutive_nuc)
        if len(matches) > 0:
            # If any match was found report it
            print(
                f"Found '{consecutive_nuc}' "
                f"in sequence '{header}' at position {matches[0]}"
            )

This is the output:

Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

3 Answers3