How do I get the bag-of-words representation of document content with Whoosh?

Question

I have an index schema like the following:

schema = Schema(
    title=TEXT(stored=True),
    content=TEXT,
    id=ID,
    topicID=NUMERIC,
)

I first get documents for topic t using searcher.documents(topicID=t). This returns hits. I'd like to get the bag-of-words representation of the hits' content field. For instance [(u'This',1),(u'is',1),(u'a',1),(u'document',1)] when content=u'This is a document'.

If there is a way to get the bag-of-words representation (or TF-IDF) more generally in Whoosh - perhaps of documents rather than hits - that is acceptable as well.

EDIT: I'd like a solution that preprocesses the bag-of-words/TF-IDF on indexing, and then getting the representation is a one-liner function or variable. Instead of doing the processing live, each time I want the representation.

score 2 · Answer 1 · answered Feb 29 '16 at 13:21

2

There are implemented functions for this in whoosh.reading.IndexReader:

whoosh.reading.IndexReader.frequency(fieldname, text)

Returns the total number of instances of the given term in the collection.
whoosh.reading.IndexReader.doc_frequency(fieldname, text)

Returns how many documents the given term appears in.

To iterate through the list of all indexed terms use:

whoosh.reading.IndexReader.all_terms()

Yields (fieldname, text) tuples for every term in the index.

answered Feb 29 '16 at 13:21

Assem

11,574
5
59
97

Thanks! While this is close, this doesn't seem to directly return the representation that I specified above. Again, what is desirable is a solution where the TF-IDF or BOW is stored in preprocessing or is retrievable in a one-liner. Not obtained through iteration. – Matt Mar 01 '16 at 22:59
You should write a function that iterate your words and use `Your_reader.frequency('content', 'document')` to get their frequencies. – Assem Mar 02 '16 at 07:55
This answer is missing two things, which I combined to make the TF-IDF work. ` content=TEXT(stored=True,vector=Frequency()),` and `docnum = searcher.document_number(id=docid); print [doccount for doccount in searcher.vector(docnum,"content").items_as("frequency")]`. This gives the BOW for a single document. – Matt Mar 03 '16 at 21:09
@Matt, feel free to add your example to the answer – Assem Mar 03 '16 at 21:14

score 0 · Answer 2 · answered Feb 24 '16 at 01:34

0

You could use a Counter for that:

from collections import Counter

bow = Counter(content.split())

gives

Counter({'This': 1, 'a': 1, 'is': 1, 'document': 1})

Here is the documentation for it.

Edit: Forgot some brackets

answered Feb 24 '16 at 01:34

soultice

49
6

Nice. :) Though it's unfortunately not what I want. Sorry, I should've clarified that I'm looking for a solution that does all the processing at the time of indexing, so that little computing time is spent on getting the representation when I ask for it after indexing (e.g. saying something like `content._bow` vs. `Counter(content.split())`) – Matt Feb 24 '16 at 22:26
Sorry it took me so long to respond. what you could do is overwrite your local Whoosh files and change the return of the function in question. Or you could inherit the function in your code and overwrite the return from there. – soultice Feb 25 '16 at 14:06
Sorry, I'm not 100% clear on how to do this. Do you have code that demonstrates? – Matt Feb 25 '16 at 20:24

How do I get the bag-of-words representation of document content with Whoosh?

2 Answers2