14

we are designing the search architecture for a corporate web application. We'll be using Lucene.net for this. The indexes will not be big (about 100,000 documents), but the search service must be always up and always be up to date. There will be new documents added to the index all the time and concurrent searches. Since we must have high availability for the search system, we have 2 application servers which expose a WCF service to perform searches and indexing (a copy of the service is running in each server). The server then uses lucene.net API to access the indexes.

The problem is, what would be the best solution to keep the indexes synced all the time? We have considered several options:

  • Using one server for indexing and having the 2nd server access the indexes via SMB: no can do because we have a single point of failure situation;

  • Indexing to both servers, essentially writing every index twice: probably lousy performance, and possibility of desync if eg. server 1 indexes OK and server 2 runs out of disk space or whatever;

  • Using SOLR or KATTA to wrap access to the indexes: nope, we cannot have tomcat or similar running on the servers, we only have IIS.

  • Storing the index in database: I found this can be done with the java version of Lucene (JdbcDirectory module), but I couldn't find anything similar for Lucene.net. Even if it meant a small performance hit, we'd go for this option because it'd cleanly solve the concurrency and syncing problem with mininum development.

  • Using Lucene.net DistributedSearch contrib module: I couldn't file a single link with documentation about this. I don't even know by looking at the code what this code does, but it seems to me that it actually splits the index across multiple machines, which is not what we want.

  • rsync and friends, copying the indexes back and forth between the 2 servers: this feels hackish and error-prone to us, and, if the indexes grow big, might take a while, and during this period we would be returning either corrupt or inconsistent data to clients, so we'd have to develop some ad hoc locking policy, which we don't want to.

I understand this is a complex problem, but I'm sure lots of people have faced it before. Any help is welcome!

axel_c
  • 6,557
  • 2
  • 28
  • 41

5 Answers5

7

It seems that the best solution would be to index the documents on both servers into their own copy of the index.

If you are worried about the indexing succeeding on one server and failing on the other, then you'll need to keep track of the success/failure for each server so that you can re-try the failed documents once the problem is resolved. This tracking would be done outside of Lucene in whatever system you are using to present the documents to be indexed to Lucene. Depending on how critical the completeness of the index is to you, you may also have to remove the failed server from whatever load balancer you are using until the problem has been fixed and indexing has reprocessed any outstanding documents.

Sean Carpenter
  • 7,681
  • 3
  • 37
  • 38
  • 1
    Sean, this is currently our candidate option. I agree with you and itsadok that it seems the sanest choice. I'm also trying to find the sources for JdbcDirectory to see if a port to .NET+SQL server would be feasible. Will keep the question open for a while to see if new approaches come up, will accept this answer otherwise. – axel_c Jun 03 '09 at 12:44
  • 1
    I checked the same thing once. It didn't seem worth the effort as there is a bunch of DB transaction related stuff that is not trivial to port to .Net. There were also complaints of reduced speed using the JDBCDirectory stuff. The source is in the Compass project - http://svn.compass-project.org/svn/compass/trunk/src/main/src/org/apache/lucene/store/jdbc/ – Sean Carpenter Jun 03 '09 at 22:39
  • 3
    After some thinking, this is what I see as the most viable solution: when an indexing/deindexing request is received, insert a row in a shared db table that works as a queue. Implement a simple win32 service that runs in both app servers and polls the queue every X seconds, indexing the content locally. When the content is succesfully indexed, the service marks the item as processed, otherwise it keeps trying. – axel_c Jun 04 '09 at 08:29
2

I know that this is an old question, but I just came across it and wanted to give my 2 cents for anyone else looking for advise on a multi-server implementation.

Why not keep index files on a shared NAS folder? How is it different from storing index in a database that you were contemplating? A database can be replicated for high availability, and so can be a NAS!

I would configure the two app servers that you have behind a load balancer. Any index request that comes in will index documents in a machine specific folder on the NAS. That is, there will be as many indexes on the NAS as your app servers. When a search request comes in, you will do a multi-index search using Lucene. Lucene has constructs (MultiSearcher) built-in to do this, and the performance is still excellent.

  • I haven't verified if this is true or not, but the following answer says "one of the major Lucene recommendations is to not use networked file systems": http://stackoverflow.com/a/8562566/1145177 The Lucerne FAQ mentions "Use a local filesystem. Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a readonly mount": http://wiki.apache.org/lucene-java/ImproveSearchingSpeed – Doug S Oct 20 '13 at 00:44
1

+1 for Sean Carpenter's answer. Indexing on both servers seems like the sanest and safest choice.

If the documents you're indexing are complex (Word/PDF and the sorts), you could perform some preprocessing on a single server and then give that to the indexing servers, to save some processing time.

A solution I've used before involves creating an index chunk on one server, then rsyncing it over to the search servers and merging the chunk into each index, using IndexWriter.AddIndexesNoOptimize. You can create a new chunk every 5 minutes or whenever it gets to a certain size. If you don't have to have absolutely up-to-date indexes, this might be a solution for you.

itsadok
  • 28,822
  • 30
  • 126
  • 171
1

in the java world, we solved this problem by putting a MQ in front of the index(es). The insert was only complete when the bean pulled from the queue was successful, otherwise it just rolled back any action it took, marked on the doc as pending and it was tried again later

Aaron Saunders
  • 33,180
  • 5
  • 60
  • 80
0

The way we keep our load-balanced servers in sync, each with their own copy of Lucene, is to have a task on some other server, that runs every 5 minutes commanding each load-balanced server to update their index to a certain timestamp.

For instance, the task sends a timestamp of '12/1/2013 12:35:02.423' to all the load-balanced servers (the task is submitting the timestamp via querystring to a webpage on each load-balanced website), then each server uses that timestamp to query the database for all updates that have occurred since the last update through to that timestamp, and updates their local Lucene index.

Each server also stores the timestamp in the db, so it knows when each server was last updated. So if a server goes offline, when it comes back online, the next time it receives a timestamp command, it'll grab all the updates it missed while it was offline.

Doug S
  • 10,146
  • 3
  • 40
  • 45