per whoosh documentation here , giving the StemmingAnalyzer an unbounded cache makes batch indexing faster:
writer = myindex.writer()
# Get the analyzer object from a text field
stem_ana = writer.schema["content"].format.analyzer
# Set the cachesize to -1 to indicate unbounded caching
stem_ana.cachesize = -1
# Reset the analyzer to pick up the changed attribute
stem_ana.clear()
# Use the writer to index documents...
the only problem is that documents are not being indexed after doing that: here's my schema:
schema = Schema(
title=TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=2.0),
content=TEXT(stored=True, analyzer=StemmingAnalyzer()),
owner=NUMERIC(stored=True),
id=ID(stored=True, unique=True),
date=DATETIME(stored=True, sortable=True),
author=TEXT(stored=True),
system=TEXT(stored=True),
url=TEXT(stored=True),
type=TEXT(stored=True),
service=TEXT(stored=True),
last_updated=fields.DATETIME)
how i index (from xml):
docs = xmlObj.findall('document')
for d in docs:
...
writer.update_document(...)
writer.commit()
after i changed the stemmer caching, nothing shows up when i do:
for doc in ix.reader().iter_docs():
#doc should be a tuple of (docnum, document)
print "docnum: {}".format(doc[0])