2

According to the Whoosh docs (and this previous question on SO), it's possible to search for an exact phrase in Whoosh by placing double quotes around the phrase for which one wishes to search. When I try to implement an exact phrase search, though, I get back what appear to be the results generated by the default search syntax. Does anyone know how I can alter my search syntax so as to match only those portions of the queried doc (Project Gutenberg's Gulliver's Travels) that contain the exact phrase "government of reason"? I would be grateful for any pointers others can offer.

from whoosh.index import create_in
from whoosh.fields import *
from whoosh import qparser
import os, codecs, nltk

def remove_non_ascii(s):
    return "".join(x for x in s if ord(x) < 128)

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

schema = Schema(content=TEXT(stored=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))

ix = create_in("indexdir", schema)
writer = ix.writer()
gulliver = codecs.open("gulliver.txt","r","utf-8")
gulliver = gulliver.read().replace("_","")
writer.add_document(content=gulliver)
writer.commit()

searcher = ix.searcher()

parser = qparser.QueryParser("content", schema=ix.schema)
q = parser.parse(u"government of reason")
results = searcher.search(q)
results.fragmenter.charlimit = None

for hit in results:
    print " ".join( remove_non_ascii( nltk.clean_html( hit.highlights("content", top=1000000) ) ).split() )

EDIT

Matt Chaput offered some code that should return exact phrases in the hit highlights of a given query in this short post, but I can't get his method to work.

Community
  • 1
  • 1
duhaime
  • 25,611
  • 17
  • 169
  • 224

0 Answers0