6

I implement search engine with solr that import minimal 2 million doc per day. User must can search on imported doc ASAP (near real-time).

I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).

I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.

how to solve this problem.

Note1:File copy really slow, when file size very large. therefore i can't use this way.

Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?

tshepang
  • 12,111
  • 21
  • 91
  • 136
Hamid
  • 1,099
  • 3
  • 22
  • 37

2 Answers2

16

You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:

http://host:8080/solr/replication?command=backup&location=/home/jboss/backup

Obviously you could script that with wget+cron.

More details can be found here:

http://wiki.apache.org/solr/SolrReplication

The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.

Community
  • 1
  • 1
Paul A Jungwirth
  • 23,504
  • 14
  • 74
  • 93
8

Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.

Some ideas to work around it:

  • Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
  • Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.
Mauricio Scheffer
  • 98,863
  • 23
  • 192
  • 275
  • 2
    @Karussell: it's just a copy and not a proper backup by itself since you can't apply backup policies like off-site storage, incremental/differential/full backup, etc. There's a lot more to backup than just copying stuff. – Mauricio Scheffer Jun 22 '10 at 04:24
  • I am not very familiar with backup stuff. But what is off-site storage? (You could place the passive index on another server) – Karussell Jun 22 '10 at 08:13
  • 1
    @Karusell: off-site storage: placing copies of the backup in other buildings/cities/states/countries. The passive index should be as closest as possible to the main index to make replication fast. Backup should also be done close to the passive index to keep replication disabled as little as possible. Only when you have that backup you can choose to store it off-site. – Mauricio Scheffer Jun 22 '10 at 13:29