1

Given a large Whoosh index, how can I efficiently retrieve n random documents from it?

I can do this horribly inefficiently just by pulling all the documents into memory and using random.sample...

random.sample(list(some_index.searcher().documents()), n)

but that will be horribly inefficient (in terms of memory usage and disk IO) if the index contains a large number of documents.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459

2 Answers2

0

Just create a new numeric field ID that should be unique and preferably auto-increment. Whoosh has not auto-increment , you should do it yourself.

Then to get your random list, just generate a list of random integers using random.randint(1, MAX_ID) than build a search query "ID:2 or ID:16 or ID:43 or ..." and use it for querying , you will get your desired list.

You can query an interval without knowing the max limit or the min limit. for example:

  • ID:[ 10 to ]
  • ID:[ to 10]
  • ID:[ 1 to 10]
  • ID:2
  • ID:2 | ID:3
Assem
  • 11,574
  • 5
  • 59
  • 97
  • Hmm... Whoosh has auto-incrementing IDs, and the ability to query the maximum value of an ID property? I'm not too familiar with Whoosh, but I've never come across either such feature; I don't find anything Googling for `whoosh autoincrement` and I don't see anything like a `Max()` class in the [`query`](https://whoosh.readthedocs.org/en/latest/api/query.html) docs. Could you add some more detail or links to this answer? – Mark Amery Feb 10 '16 at 19:11
  • Whoosh don't have auto-increment , you should do it yourself. and Yes you can query something without knowing the max limit or the min limit. for example: `ID:[ 10 to ]` or `ID:[ to 10]` or `ID:[ 1 to 10]` or `ID:2` or `ID:2 | ID:3' – Assem Feb 28 '16 at 22:03
0

There might be a better way, but what worked for me in similar situations was assigning a random number to every document while indexing. Every document gets a field named rand_id with a random number. You can then generate another random number x at the time of searching and search for rand_id > x. You can then limit the search to n items. If the search didn't yield enough results, search again for rand_id < x and take the rest.

kichik
  • 33,220
  • 7
  • 94
  • 114