Best practices for syncing Lucene repository with source data?

Question

I am designing an application which will have a heavy reliance on searching using a Lucene.NET repository. The repository will be built using data from an operational database that is constantly changing. I'm trying to figure out the best strategy to keep the Lucene repository synced up with the source database. Should I have a service running that wakes up every few minutes, queries the database for updated records, and adds/removes from the Lucene index? Should I rebuild the Lucene repository every night and tolerate some latency in the data?

What are the best practices for keeping the data in a Lucene repository fresh? How do the different strategies affect latency, performance, etc.?

How did you end up approaching this? – Matthew Moisen Sep 30 '13 at 20:47 — Matthew Moisen, Sep 30 '13 at 20:47

score 3 · Accepted Answer · answered Nov 22 '11 at 20:44

3

Lucene is capable of performing so called near real-time search, which means that the updates to the index can be seen in query results almost instantly. So you can freely send the updates as soon as they are saved in the database -- Lucene should have no problem in handling even quite frequent updates, as for example Twitter search is built with it (of course, to maintain such big load, you would need to distribute your index).

So preferably, you would send your updates in some code that triggers after transaction is committed. It is hard to say anything more specific, without knowing what database or queuing system are you using. Some general thoughts on this matter, as well as examples of using it along with CouchDB or RabbitMQ are shown in elasticsearch river documentation.

answered Nov 22 '11 at 20:44

Artur Nowak

5,254
3
22
32

1

So the intent is that you *never* rebuild the Lucene repository from scratch? You build it once at the birth of the application and then just keep feeding it updates? Or is it a good practice to rebuild it now and again? – RationalGeek Nov 23 '11 at 12:42
2

You should never rebuild the index unless you are forced to do so (e.g. by changes in the data structure). Index is kept in a good shape by so called 'merge policy' (you can find a lot on it on the Web). If you predict a lot of deletion operations, then you may consider running `optimize()` once in a while to actually remove the documents and decrease the size of the index (delete only marks documents as deleted). – Artur Nowak Nov 23 '11 at 16:03

Best practices for syncing Lucene repository with source data?

1 Answers1