1

I have a parsed fastq file, and I am doing some operations with the reads. Concretely, I am trying to determine if my fastq files have reads that belong to a microorganism contamination, instead of my human sample. So, if my read is a contamination, I need to remove it from my fastq file in order to have only the reads that belong to the human sample. I tried this;

for record_seq in SeqIO.parse("file.fastq","fastq"):
    if condition==T:
        record_seq=""


But with this code i wasnt deleting my record, cause i had the same records counts in the final file.
So i thinked about the method pop , but i can not use it due to it is only for lists, and record_seq is a SeqRecord object... any ideas? Thanks!

2 Answers2

0

Test fasta file:

@seq1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@seq2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@seq3
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Code:

from Bio import SeqIO

if __name__ == "__main__":
  # Read sequence and store, based on condition
  seqs = [seq for seq in SeqIO.parse("test.fastq", "fastq") if seq.name != "seq2"]
  # Overwrite file
  with open("test.fastq", "w") as fh:
    SeqIO.write(seqs, fh, "fastq")

Final file:

@seq1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@seq3
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
777moneymaker
  • 697
  • 4
  • 15
0

Try this if this is what you want. Suppose that you have specific sequences belong to your microorganism, make a list as below and then run the code:

micro_organism_seqs=['seq1','seq2','seq3',...]
#open a new file
new = open("new_file.fastq", "w")

with open(fastq,'r') as f:
    for record in SeqIO.parse(f,'fastq'):
       #search for every seq belongs to microorganism
       for seq in micro_organism_seqs: 
          #save reads if do not have microorganism reads inside
          if seq not in record.seq:
              #save reads in the fastq format
              new_fastq = record.format('fastq')
              #write to the file
              new.write(new_fastq)

new.close()

Apex
  • 1,055
  • 4
  • 22