16

What is the best practice to backup a lucene index without taking the index offline (hot backup)?

yannisf
  • 6,016
  • 9
  • 39
  • 61
  • Lucene in action prescribes a way using SnapshotDeletionPolicy but does not explain much more. Tthat is, while having acquired an IndexWriter with SnapshotDeletionPolicy, can other IndexWriters write to the index? If not, can "this" IndexWriter write to the index? – yannisf May 05 '11 at 12:39

4 Answers4

22

You don't have to stop your IndexWriter in order to take a backup of the index.

Just use the SnapshotDeletionPolicy, which lets you "protect" a given commit point (and all files it includes) from being deleted. Then, copy the files in that commit point to your backup, and finally release the commit.

It's fine if the backup takes a while to run -- as long as you don't release the commit point with SnapshotDeletionPolicy, the IndexWriter will not delete the files (even if, eg, they have since been merged together).

This gives you a consistent backup which is a point-in-time image of the index without blocking ongoing indexing.

I wrote about this in Lucene in Action (2nd edition), and there's paper excerpted from the book available (free) from http://www.manning.com/hatcher3, "Hot Backups with Lucene", that describes this in more detail.

Michael McCandless
  • 1,176
  • 7
  • 5
  • Thank you for replying! In the book it is not clear though whether I can always use the SnapshotDeletionPolicy decorator for my IndexWriters. Since I will use just one IndexWriter for my application that will not be closing frequently, can it always use SnapshotDeletionPolicy or this will have impact on the performance? – yannisf May 13 '11 at 07:10
  • 1
    You can just use it always; there will be no real change to perf. – Michael McCandless May 13 '11 at 19:13
2

This answer depends upon (a) how big your index is and (b) what OS you are using. It is suitable for large indexes hosted on Unix operating systems, and is based upon the Solr 1.3 replication strategy.

Once a file has been created, Lucene will not change it, it will only delete it. Therefore, you can use a hard link strategy to make a backup. The approach would be:

  • stop indexing (and do a commit?), so that you can be sure you won't snapshot mid write
  • create a hard link copy of your index files (using cp -lr)
  • restart indexing

The cp -lr will only copy the directory structure and not the files, so even a 100Gb index should copy in less than a second.

Upayavira
  • 21
  • 1
  • Your answers seems to be the most performant but: 1. works only on filesystems where hard links are supported 2. requires that all indexing operations are stopped. Could there be another way? – yannisf May 05 '11 at 13:17
  • IF your filesystem does not support hard links, then you're going to have to copy the files, which is slower, but still works. Actually, it doesn't require that indexing operations are stopped, it just requires that no commits/writes are done while the copy is happening. – Upayavira May 05 '11 at 13:46
  • Which I am afraid means that indexing should be stopped (searching is allowed). Anyway, your answer is probably the best one might try. – yannisf May 05 '11 at 14:55
1

In my opinion it would typically be enough to stop any ongoing indexing operation and simply take a file copy of your index files. Also look at the snapshooter script from Solr which can be found in apache-solr-1.4.1/src/scripts, which essentially does:

cp -lr indexLocation backupLocation

Another options might be to have a look at the Directory.copy(..) routine for a progammatic approach (e.g., using the same Directory given as constructor parameter to the IndexWriter. You might also be interested in Snapshooter.java which does the equivalent of the script.

Johan Sjöberg
  • 47,929
  • 21
  • 130
  • 148
0

Create a new index with a separate IndexWriter and use addIndexesNoOptimize() to merge the running index into the new one. This is very slow, but it allows you keep the original index operational while doing the backup.

However, you cannot write to the index while merging. So even if it is online and you can query the index, you cannot write to it during the backup.

Karl-Bjørnar Øie
  • 5,554
  • 1
  • 24
  • 30
  • This is what I used to do, but as you said, it is not hot and it is slow. Thus, the best solution is the cp -lr method, since it only takes millis. – yannisf May 06 '11 at 08:43
  • 1
    yeah, being pragmatic is better, taking the index offline for a little while is preferable to a slow, slow semi-hot copy :) – Karl-Bjørnar Øie May 09 '11 at 14:41