-1

I have to check if a file is FASTA, FASTQ or none of those. For the FASTA checking i used the module SeqIO from Bio:

def is_fasta(filename): 
  with open(filename, "r") as handle: 
    fasta = SeqIO.parse(handle, "fasta") 
    return any(fasta)

Which returns True if the file is FASTA and False if it isn't. But when I use the FASTQ version of this function:

def is_fastq(filename):
    with open(filename, "r") as handle:
        fastq = SeqIO.parse(handle, "fastq")
        return any(fastq)

I get an error message:

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in next return next(self.records) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 1085, in iterate for title_line, seq_string, quality_string in FastqGeneralIterator(handle): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 932, in FastqGeneralIterator "Records in Fastq files should start with '@' character"

ValueError: Records in Fastq files should start with '@' character

Can someone help me understand why doesn't it work the same way for FASTA and FASTQ? And how can I check if the file is a real FASTQ

C insi
  • 13
  • 3
  • The BioPython FASTQ parser specifically aims to parse FASTQ records. If you pass a file where the first record doesn’t start with `@` it raises an error. The FASTA parser won’t raise an error if you pass a FASTQ file. Instead you should use a `try` `except` – Alex Oct 17 '21 at 11:40
  • @alex does your comment above states that the filename submitted as FASTQ is not a FASTQ file ? – pippo1980 Oct 17 '21 at 18:35
  • @pippo1980, I don’t fully understand what you’re asking. When `SeqIO.parse` is called a file handle and the file format name are passed. When the format is "fasta" a FASTA parser is used, it will iterate through an entire FASTQ file without raising an error or returning any records. When the format is "fastq" a FASTQ parser is used, it will raise an error when a FASTQ file isn’t provided. The file extension (`.{fa,fasta,fq,fastq}`) is not considered or used; only the file format name. – Alex Oct 17 '21 at 22:07
  • I have the feeling that you are asking about something you already solved. You want to know if a file is a valid FastQ file, right? If you give the file to the FastQ parser and it fails, then it is not a valid FastQ file. If it works, then it is a valid FastQ file. You already have that information in your code. In you example, you get a ValueError, which clearly tells you that a FastQ file should start with @, and it is not the case with your file. – Poshi Nov 08 '21 at 16:30

1 Answers1

0

as per @Alex suggestion here my attempt:

from Bio import SeqIO


# filename = 'fastq.fastq'

filename = 'fasta.fasta'

def is_fasta(filename): 
  with open(filename, "r") as handle: 
    fasta = SeqIO.parse(handle, "fasta") 
    return any(fasta)



def is_fastq(filename):
    with open(filename, "r") as handle:
        fastq = SeqIO.parse(handle, "fastq")
        
        try : return any(fastq)
        
        except Exception as e:
            print(e)
            return False


print(' is it fasta ? : ',is_fasta(filename))

print(' is it fastq ? : ',is_fastq(filename))

needs two files to be used alternatively:

`'fastq.fastq'`  or `'fasta.fasta'`

uncomment just one of them.

result with right fastq file:

is it fasta ? :  False
is it fastq ? :  True

result with right fasta file:

is it fasta ? :  True
Records in Fastq files should start with '@' character
is it fastq ? :  False

Kind of seems to me fasta parser won't throw any error if file read isn't right but just provide an empty iterator, while fastq parser will warn you about file being wrong, please @alex correct me if I am wrong (I am learning too)

pippo1980
  • 2,181
  • 3
  • 14
  • 30