I have an index schema like the following:
schema = Schema(
title=TEXT(stored=True),
content=TEXT,
id=ID,
topicID=NUMERIC,
)
I first get documents for topic t
using searcher.documents(topicID=t)
. This returns hits. I'd like to get the bag-of-words representation of the hits' content
field. For instance [(u'This',1),(u'is',1),(u'a',1),(u'document',1)]
when content=u'This is a document'
.
If there is a way to get the bag-of-words representation (or TF-IDF) more generally in Whoosh - perhaps of documents rather than hits - that is acceptable as well.
EDIT: I'd like a solution that preprocesses the bag-of-words/TF-IDF on indexing, and then getting the representation is a one-liner function or variable. Instead of doing the processing live, each time I want the representation.