I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2
from PostgreSQL database, then I insert them with writer.add_document(some_data_here)
into Whoosh index. My writer
object is as follows:
writer = index.writer(limitmb=1024,procs=5,multisegment=True)
The problem is that executing index.searcher().documents()
(which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term
that fits every record - I get identical result (around 5 mln).
I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer
's parameters, again without success.
I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.