7

I am interested in running Lucene.NET for an application that runs in Windows clusters. The search problem itself is reasonably small, but the stateless/cluster problem still has to be handled.

I understand that SOLR handles my scenario (and more) but requiring a servlet container (and Java) poses some problems for me. Depending on the complexity of a Lucene.NET based approach it may still be a vial option, though.

My question now is what options I have for handling the problem of running on multiple hosts:

  • Persist on a shared storage, common for all nodes? Would Lucene.NET handle concurrency transparently? Would servers use RAM for caching, and if so does Lucene.NET handle invalidation of this based on updated files transparently?

  • Replication? Each server has its own copy of everything it needs. On any update, all servers get a new replica (or diff if this is reasonably simple). Existing tools for this, or up to me to handle?

  • Workload partitioning/sharding? Each server handles only its own data, both for reads and updates. Tools for handling this, joining partial results etc?

  • Other options I may have missed in my initial investigation?

When experimenting with a local version, my Lucene directory was in the order of a couple hundred megs. Longer-term I can see 1-5 GB perhaps. If the frequency of updates is a difficulty I can control this fairly flexibly. Concurrent read/search loads are expected to be very moderate.

Amro
  • 123,847
  • 25
  • 243
  • 454
  • 1
    Not a direct answer, but take a look at elasticsearch (http://www.elasticsearch.org/) - handles most of your needs quite easily. – Mikos Feb 16 '12 at 01:33
  • What, if any, requirements do you have for keeping your data in sync between cluster members? We're in the middle of a fairly large scale cluster deployment of Lucene.NET and I might be able to provide some guidance if I understood your situation better. – M.Babcock Sep 29 '13 at 04:12

1 Answers1

0

You can use lucene.net with multiple servers but you have to implement an indexing server.

All changes you make should be queued and every now and again index the pending documents. Also you should immediately index if x items are in the queue (x depends on your merge docs setting this was 25,000 for me).

The reasoning behind the above is you need to avoid making small changes to the index as this will degrade performance overtime due to many small files been created. Uou can run 2 indexing servers but only 1 will index at a time due to locking on the index the only reason to do this is for fail over if the first goes down, depends on your needs.

I have used an index of 15Gb with 30 million records. The scenario I had with this was under azure.

  • 1 worker role to index changes

  • 2 - 20 web roles serving content each holding the index.

Changes were pushed every 15mins and the index was merged at 25,000 changes and each combined index containing 250,000 documents. Each web server checked blob storage for changes every 15mins and locked the index reader which was then invalidated if changes were downloaded. Your max documents per file is basically to stop the web servers downloading lots of previous changes.

I did use Lucene.AzureDirectory to begin with but it wasn't reliable at detecting changed blobs in blob storage so I ended up iterating the blobs and compared locally and downloaded as necessary.

Now would I implement something like this again? the answer is a big no. I would use elasticsearch or solr instead as you are reinventing the wheel.

Dreamwalker
  • 3,032
  • 4
  • 30
  • 60