4

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.

How should I do that?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Dawud
  • 51
  • 3

3 Answers3

1
  1. The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq. This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
  2. The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
OronNavon
  • 1,293
  • 8
  • 18
1

I ended up to use below codes in R and solved my problem.

library(ShortRead)

fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)


trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC", 
                       sread(reads_fq), invert = TRUE)]

writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
zx8754
  • 52,746
  • 12
  • 114
  • 209
Dawud
  • 51
  • 3
0

You can use the Python package biotite for it (https://www.biotite-python.org).

Let's say you have the following FASTQ file:

@Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G@?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7@+6!FDC&
+4=4+,@2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
@Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,@'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?@2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'@",8:#5%18&+3>'8:28+:5F0);E9<=,+

This is a script, that should do the work:

import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq


# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]

fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
    sequence = fastq_file.get_sequence(header)
    # Iterative over each of the consecutive sequences
    for consecutive_nuc in consecutive_nucs:
        # Find all indices, where a match was found
        matches = seq.find_subsequence(sequence, consecutive_nuc)
        if len(matches) > 0:
            # If any match was found report it
            print(
                f"Found '{consecutive_nuc}' "
                f"in sequence '{header}' at position {matches[0]}"
            )

This is the output:

Found 'CCCCCCCC' in sequence 'Read:01' at pos 8
Padix Key
  • 857
  • 6
  • 15