4

I want to find a phrase in a document, I've used the codes in the quick start.

>>> from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
>>> ix = create_in("indexdir", schema)
>>> writer = ix.writer()
>>> writer.add_document(title=u"First document", path=u"/a", content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b",  content=u"The second one is even more interesting!")
>>> writer.commit()
>>> from whoosh.qparser import QueryParser
>>> with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse("first")
        results = searcher.search(query)
        results[0]

    result: {"title": u"First document", "path": u"/a"}

But then I find they will split the keywords into several single word and then search the document. If I want to search a phrase like "the first guy here in the document", what should I do.

On the document ,it said, use

"it is a phrase"

if I want to search for:

it is a phrase.

That confuses me.

Besides, here is a class ,which seems can help me , but I don't know how to use it.

class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
 Matches documents containing a given phrase.

Update: I use it in this way, but there is no matches.

from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), path=ID(stored=True),   content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", path=u"/a",
                 content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b",
               content=u"The second one is even more interesting!")
writer.commit()
from whoosh.query import Phrase

a = Phrase("content", u"the first")

results = ix.searcher().search(a)
print results

result:

Top 0 Results for Phrase('content', u'the first', slop=1, boost=1.000000) runtime=0.0>

Update according to theOther

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse(**'"first x document"'**)
results = searcher.search(query)
print results[0]

result : Hit {'content': u"This is the first document we've added!", 'path': u'/a', 'title': u'First document'}>

I think there should be no matched result ,as there is no "first x document" in the document. Otherwise, it is not an exact match.

daniel_zeng
  • 43
  • 1
  • 5

2 Answers2

3

You should give Phrase a list of words not a string as second argument, and also eliminate the because it is a stop word:

a = Phrase("content", [u"first",u"document"])

instead of

a = Phrase("content", u"the first")

Read in documentation:

class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
Matches documents containing a given phrase.

Parameters:

fieldname – the field to search.

words – a list of words (unicode strings) in the phrase.

The natural use of phrase search in whoosh is by using Quotes " " in QueryParser:

>>> with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse('"first document"')
        results = searcher.search(query)
        results[0]

Update: for "first x document" what it matches, it is because x and all one-character words are stop-words and are filtered.

Assem
  • 11,574
  • 5
  • 59
  • 97
  • Hi,thanks for your response firtslty. But there are still some issues. 1. for the list , how can whoosh know the exact phrase combined by the words in the list. Are they combined one by one in sequence? Because I change the document content to "This is the first x document we've added!", and the phrase is Phrase("content", [u"first", u"document"]), but there is still a match.I want an exact match with the phrase. – daniel_zeng Oct 22 '15 at 02:39
  • 2. then I try the second way, document content is still "This is the first document we've added!", and the query is query = QueryParser("content", ix.schema).parse('"first x document"'), there is also a matched result. That seems to be not eaxt match with the query. – daniel_zeng Oct 22 '15 at 02:42
  • It matches because `x` is a stop-word and all one-character words. "first x document" after filtering the stopwords would be "first document". – Assem Oct 22 '15 at 07:46
  • @Assem this is not correct. see a MWE https://stackoverflow.com/questions/72688735/whoosh-phrase-search-return-empty-result-with-upper-case-character – LearnToGrow Jun 20 '22 at 14:44
  • @LearnToGrow may be its different in the newer versions – Assem Jul 25 '22 at 20:31
1

To find a phrase in a content, use phrase=True when defining Schema as follows

schema = Schema(title=TEXT(stored=True), content=TEXT(phrase=True))

Then simply use double quotes within single ones as follows

query = QueryParser("content", schema=ix.schema).parse('"exact phrase"')
Max
  • 1,685
  • 16
  • 21