2

Here is example in which I try to index large collection with whoosh

schema = Schema(name=TEXT(stored=True), m=ID(stored=True), content=KEYWORD(stored=True))
ix = create_in("indexdir", schema)
from whoosh.writing import BufferedWriter
from multiprocessing import Pool
jobs = []

writer = BufferedWriter(ix, period=15, limit=512, writerargs = {"limitmb": 512})
for item in cursor:
    if len(jobs) < 1024:
        jobs.append(item)
    else:
        p = Pool(8)
        p.map(create_barrel, jobs)
        p.close()
        p.join()
        jobs = []
        writer.commit()

create_barrel function in the end does the following:

writer.add_document(name = name, m = item['_id'], content = " ".join(some_processed_data))

yet after a few hours of running the index is empty and the only file in the indexdir is lock file _MAIN_0.toc

The code above kind of works when I switch no AsyncWriter but for some reason AsyncWriter misses around 90% of commits and standard writer is too slow for me.

Why does BufferedWriter miss commits?

Moonwalker
  • 2,180
  • 1
  • 29
  • 48

1 Answers1

1

the code looks a little problematic for cases where the cursor iterator is not giving a precise multiple of 1024 items.

at the end, you will have < 1024 items in the jobs list then and it will leave the for-loop. do you handle this remainder after the for-loop?

besides that: which whoosh version are you using?

did you try latest 2.4x branch and default branch code from the repo?

Thomas Waldmann
  • 501
  • 2
  • 7
  • I've tried only 2.4.1. it is installed by pip. Which other branch should I try? Also you are right about other problems with the code yet I have 3.5 millions of items and do not really care if some of them will be missing. – Moonwalker Apr 12 '13 at 10:32