Some rows lost when creating Whoosh index in Python

Question

I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2 from PostgreSQL database, then I insert them with writer.add_document(some_data_here) into Whoosh index. My writer object is as follows:

writer = index.writer(limitmb=1024,procs=5,multisegment=True)

The problem is that executing index.searcher().documents() (which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term that fits every record - I get identical result (around 5 mln).

I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer's parameters, again without success.

I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.

Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale. — Jon Clements, Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw [others doing even more](https://groups.google.com/forum/#!topic/whoosh/lB51uBBbkeg) and not reporting issues, so I believe there is a way. — adamczi, Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it? — Jon Clements, Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks. — adamczi, Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it. — Jon Clements, Nov 25 '18 at 22:52
Generally you could try to reproduce the effect with a smaller dataset (e.g. 10 000 rows) to find out if it happens always or only if some size limit is exceeded. — Michael Butscher, Nov 25 '18 at 23:15
I managed to load all ~7 mln rows by splitting them into bulks of 400 000. Then I ran each of those (about 20) manually, checking amount of documents in index after each of them. I had to do this manually, because every kind of script failed to load 100% of data. Maybe Whoosh needs some timeout after `commit`? Question unresolved. — adamczi, Nov 28 '18 at 10:04

Some rows lost when creating Whoosh index in Python

0 Answers0