1

per whoosh documentation here , giving the StemmingAnalyzer an unbounded cache makes batch indexing faster:

writer = myindex.writer()
# Get the analyzer object from a text field
stem_ana = writer.schema["content"].format.analyzer
# Set the cachesize to -1 to indicate unbounded caching
stem_ana.cachesize = -1
# Reset the analyzer to pick up the changed attribute
stem_ana.clear()

# Use the writer to index documents...

the only problem is that documents are not being indexed after doing that: here's my schema:

schema = Schema(
                title=TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=2.0),
                content=TEXT(stored=True, analyzer=StemmingAnalyzer()),

                owner=NUMERIC(stored=True),
                id=ID(stored=True, unique=True),
                date=DATETIME(stored=True, sortable=True),
                author=TEXT(stored=True),
                system=TEXT(stored=True),
                url=TEXT(stored=True),
                type=TEXT(stored=True),
                service=TEXT(stored=True),
                last_updated=fields.DATETIME)

how i index (from xml):

docs = xmlObj.findall('document')
for d in docs:
    ...

    writer.update_document(...)

writer.commit()

after i changed the stemmer caching, nothing shows up when i do:

for doc in ix.reader().iter_docs():
    #doc should be a tuple of (docnum, document)
    print "docnum: {}".format(doc[0])
Hakim
  • 1,242
  • 1
  • 10
  • 22

1 Answers1

0

Looks Like you are only updating the document when you index.

So if the document isn't there nothing happens, so hence nothing is getting indexed!

try writer.add_document(...)

mmadvert
  • 1
  • 1