SeqIO.index() doesn't return a true dictionary, but a dictionary like object, giving the SeqRecord objects as values:
Note that this pseudo dictionary will not support all the methods of a
true Python dictionary, for example values() is not defined since this
would require loading all of the records into memory at once.
This dictionary like object is an _IndexedSeqFileDict
instance. The docstring mentions:
Note that this dictionary is essentially read only. You cannot
add or change values, pop values, nor clear the dictionary.
So, you will need to convert your fastq file to an in-memory Python dictionary using SeqIO.parse()
and SeqIO.to_dict()
:
from Bio import SeqIO
ref_reads = SeqIO.parse("file1.fastq", "fastq")
spk_reads = SeqIO.parse("file1.fastq", "fastq")
ref_reads_dict = SeqIO.to_dict(ref_reads)
for spk in spk_reads:
if spk.id in ref_reads_dict:
del ref_reads_dict[spk.id]
If your files are so large that working with SeqIO.parse()
isn't feasible, then I would do something like this:
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator',
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# this step might take a long time if your files are large
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys}
Edit, answer to your comment:
how can I again solve the original problem of deleting items from SeqIO.index("file1.fastq", "fastq")?
Like I stated above, SeqIO.index("file1.fastq", "fastq")
returns a read-only _IndexedSeqFileDict
object. So you can not, by design, delete items from it.
The updated code below shows how you can create a new fastq file where the overlapping reads are removed.
If you really want a new SeqIO.index()
object, then you can read this file in again with SeqIO.index()
.
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# conserve memory by using a generator expression
unique_ref_records = (ref_reads[key] for key in unique_ref_keys)
# output new file with overlapping reads removed
with open(fname_out, "w") as output_handle:
SeqIO.write(unique_ref_records , output_handle, "fastq")
# optionally, create a new SeqIO.index() object
unique_ref_reads = SeqIO.index(fname_out, "fastq")