2

I'm working on trying to automate some BLAST searches. I need to pick up only the top three results from the BLAST results, however the parameter hitlist_size doesn't seem to be limiting my searches to only three results. No matter what size I specify I still end up getting > 3 hits for most samples (in others I get three, although I'm not sure if this is simply coincidence) . Does anyone have any insight to this?

import Bio
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
fasta_string = open("FASTA_Files/48_50.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string,hitlist_size=3)
    with open("XML_Files/48_50.xml", "w") as save_to:
        save_to.write(result_handle.read())
        result_handle.close()

The following FASTA sequence produces the expected number of hits for 46 but exceeds the expected number of hits for 47:

>46
NNNNNNNNNNGNNNNNNTGCAGTCGNANNNNNNNNNNNNNNNNNAGCTTGCTGCTTCGCTGACGAGTGGCGGACGGGTGA
GTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTGGCTAATACCGCATAACGTCGCAAGACCAA
AGAGGGGGACCTTCGGGCCTCTTGCCATCAGATGTGCCCAGATGGGATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGG
CGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCA
GTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTAC
TTTCAGCGGGGAGGAAGGNGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCANAANAAGCACCGGCTAACTCCGT
GCCAGCAGCCGCGGTAATACGGAGGGTGCANGCGTTAATCGNAATTACTGGNCGTAAAGCGCACGCAGGCGGTCTGTCAA
GTCTGATGTGAAATCCCCGGGCTCAACCTGGNAACTGCATTCNAAACTGGCAGGCTTGAGTCTTGNAGANGGGGGGTAGA
ATTCCAGGTGTANCGNTGAAATGCGTANAGATCTGGANGNAATACCGGTGGCNAANGCGGCCCCCTGGACAAAGACTGAC
GCTCANGTGCGAAAGCGTGGNGGAGCAAANAGGATTAGATACCCTGGTAGNCCNNCGCCNNANACGATGTCTACNTGGNA
GGNTTGTGNCCTTGAGGCGTGNCTNNCCNGNAGCTNAACGCGTTTAAGTANANCNNCCTGGGGCGAGCNACGGGCNGCNN
GGNTNAAAACNNNNNNTGNNNTTNNACGGGNGNNCCCCGCANNANCCGGCTGNNAGCATGNTGGATTNAANTTCGATNNN
NNCGCGAAGAANCCNNANNNNNGNNNNNNNANNNNNNNNNNAANNNTNNNNNANANNNNNNNNNNNNNNNGNNNNCTNNN
AGNANNGNNNCTGCATGGNNNNCNNNCNNGNTCNNGNNNNNNNNNNNNNGNNNNNNNNCCNNNANNNNNNNNNNNNTNAT
CNNNNNNNNNNNNNTNNNNNNNNNNNNNAGNNNNNNNNCNNNNNNNNNNANNTNNNAANN
>47
NNNNNNNNNNNNNNNNNNNNNNNNGNNNNNCGGGTNNCCNNNNGCTGGNTGNNNTGCTGACGAGTGGCGGACGGGTGAGT
AATGTCTGGGAAACTGCCTGAAGGGGGGGGATTCCTACTGGCCAGGGTGGCTAATACCGGGTAACGTCGNNNGANCAAAG
AGGGGGACCTTCGGGCCTCTTGCCATCACATGTGCCCGGATGGGATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGCG
ACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCNCACTGGAACTGAGACACGGTCCCGACTCCTACGGGAGGCANCAGT
GGGGAATATTGCTCTTGGGCGCAAGCCTGATGCAGCCATGCCGCGGGTATGAGGAAGGCCTTCGGTTTGTAAAGTACTTT
CTCCGGGGAGGAAGGNGTNGTGGTGAATAACCGCTACANTTGANNCTNCCCGCNNAANAACCACCNGNTAACTCCNTGCN
NNNNGCCGCGGTAATACGGANGGTGCAAGNGTTAATCGNANTTACTGNNTGTTGAGCGCACGNNGGCGGCCTGTCNNNTC
TNATGTGAGATCCCCGGGCTCNCCCTGNNACCTGCATTCGNNNNNTGNNANGCTNGANTCTTGNNNNGNNGNGGNAGNAA
TTCCNNGTGTNNCGNNGAAATGCNNANAGATCTGNANANANNACNGGNGNCCAANGNNGNCCCCTGNTCTCNGACTGACG
CNNGAGTGCTGAANNGTGNAGAGCGNACAGGATTANANNNCCNGNTAGNCCGNCNCCNCACACCNATGTCTACATGNGAG
GNTNNNGNNNNNTGNGGCNNNNCNNTCCNNNAGCTNANGNNGTTNAANTANATCNNNCTNNNNCNNGCNNGGGGNCANNA
NGGGNNNAAANNTNNNNATNAATNTGACGGANNNNNCNNNNNNNNCNNNNNNANCATGNGGATTNANNNTNNNTNNNNNC
NNNNNANAACCNNANNNNNNNNNNNNNNTNNNNNNNANNNTNNNNNNNTNNNNNNGNNNNCNNNNNNNNACTNNNNNNNC
NNNNNNNNCANNGNNNNNNNNNNNANGNTNNNNNTGNNNNAANNNTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTN
NNNNNNTNNNNNNNTNNNNTNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNTCNNNNNN
  • Welcome to StackOverflow! Can you provide a reproducible example of this (i.e. some fasta sequence that produces more than >3 hits?) I am trying to get the same problem but I seem to get the right number of hits (under ``) – cnluzon Apr 09 '19 at 13:03
  • Sure! Could you also share your fasta sequence so I could have a look? – AccidentalCube Apr 09 '19 at 13:10
  • @cnluzon Update on this: I think my problem is with my XML parsing. I'm in the process of actually figuring out what I did wrong, but I think it has something to do with my `for` loop calling `alignment.hsps` and looping around those, as opposed to my hits. `from Bio.Blast import NCBIXML` `with open("XML_Files/46_47.xml") as result_handle:` `blast_records = NCBIXML.parse(result_handle)` `x = 0` `for rec in blast_records:` `for alignment in rec.alignments:` `for hsp in alignment.hsps:` `#print(rec.query)#DNA ID` – AccidentalCube Apr 09 '19 at 13:49
  • Oh I see! Well if you are looking for alignments it may be the case that you should use the parameter `alignments=3` instead of `hitlist_size` as defined here: http://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html – cnluzon Apr 09 '19 at 13:52
  • @Cnluzon Do you think you could quickly define alignments vs hitlist? I'm not quite seeing their difference. – AccidentalCube Apr 09 '19 at 13:56
  • I am not 100% sure because I do not see more info in the documentation, but my guess would be that more than one alignment could correspond to a single hit, and that would be the reason why you get more than the specified number of elements in your lists, because you are accessing `rec.alignments`. – cnluzon Apr 09 '19 at 14:08

1 Answers1

1

The problem seems that a hit is something different than an alignment. A hit is a sequence match in the database, whereas an alignment is the actual position of the nucleotides. This could happen several times for the same sequence in the database.

In the case you tried, the xml that this saves has indeed three items under <Hit>, but the last hit you get has seven <Hsp> records, which seem to be what you are iterating through with the loop:

for rec in blast_records:
    for alignment in rec.alignments:
        for hsp in alignment.hsps:
            print(rec.query) #DNA ID

Specifying alignments=3 instead of hits will get you 3 alignments maximum per hit, I think. You can specify both, if you want to control number of hits and number of alignments.

cnluzon
  • 1,054
  • 1
  • 11
  • 22