5

We are on Sitecore 6.4 and are using the shared source advanced search module and are seeing a big degredation in site search performance when the Sitecore re-index process kicks in and updates the changes to the web database.

When we kick off a full site publish, the indexing manager picks up the changes and processes the history records, which in turn re-indexes each item that has been affected. As this is happening per item you can see the Lucene index on disk changing whilst looking at the directory (the number of files grow and change as you watch it).

If you try and search on the public website when this is happening, the search can take noticibly longer to complete; and under heavy load it can take up to 15 seconds longer until the re-index process has finished.

I can see this process is controlled by the IndexingProvider class. Is there any way in which to override this class and implement our own?

We have looked at the searching logic and can see that an IndexSearchContext object is created each time a search is requested, which in turn creates a new IndexSearcher. We have changed some of the logic so that the IndexSearchContext is preserved as a singlton, which of course means that multiple requests can be served by the same Lucene IndexSearcher. This has drastically reduced memory consumption as using the same searher is recommended to increase performance.

However, in doing this, changes to the index will not be picked up until a new IndexSearcher is created. We need a way in which to notify our code that the indexing process has finished and then we can reset our singleton IndexSearchContext object. How might we integrate this logic into the Sitecore configured code?

When rebuilding the index manually it only takes about 5 seconds to complete. Obviously this effectively deletes the index and then creates it all again but why does the item by item update take so long? Is there not a better way in which the update process can be achieved without going item by item and it not affecting the public website?

I would have expected others to be affected by this problem so I'm keen to hear how people have tackled the problem.

EDIT - additional info from Sitecore forum

The Sitecore.Search code does seem to make heavy use of creating/disposing new Lucene objects for a single operation. It does not seem overly scalable for large environments, which is why I was surprised when I saw the code. Especially if the indexes are large and there are a lot of content updates/publishes each day.

Looking at the classes via dotPeek I cannot see how we would override the IndexUpdateContext as it's created in a non virtual method. A custom DatabaseCrawler could get some access but only to the context object already created.

I notice that we can define our own Index implementation in the web.config for each index. We can also re-implement the crawler (we already have the advanced crawler in place from the shared module) and maybe get some control of the indexing process. I would be reluctant to pull out too much of the Sitecore code into our own implementation as it may affect future updates.

I have one question though regarding the IndexingProvider. In the following method:

private void UpdateItem(HistoryEntry entry, Database database)
    {
      int count = database.Indexes.Count;
      if (count != 0 || this.OnUpdateItem != null)
      {
        Item obj = database.GetItem(entry.ItemId, entry.ItemLanguage, entry.ItemVersion);
        if (obj != null)
        {
          if (this.OnUpdateItem != null)
            this.OnUpdateItem((object) this, (EventArgs) new SitecoreEventArgs("index:updateitem", new object[2]
            {
              (object) database,
              (object) obj
            }, new EventResult()));
          for (int index = 0; index < count; ++index)
            database.Indexes[index].UpdateItem(obj);
        }
      }
    }

It fires the update event, which is handled by the DatabaseCrawler as it attached to the IndexingProvider.OnUpdateItem event; but why does the method above also call the Sitecore.Data.Indexing.Index.UpdateItem method? I thought that namespace was being depreciated in version 6.5 so I'm surprised to see a link between the new and the old namespace.

So it looks like the DatabaseCrawler is handling the update, which deletes the item and then adds it again to the index; and then the old Sitecore.Data.Indexing.Index also tries to update it. Surely there is something wrong here? I don't know though so please correct me if I am wrong, this is just what it looks like when I track through the decompiled code without any debugging.

Tim Peel
  • 2,274
  • 5
  • 27
  • 43
  • This seems like advanced (and pretty neat stuff) you're wanting to do (and have done so far). Maybe you should try to talk to Sitecore about it... if you have improvements, I'm sure they love to hear about them :) – Holger Oct 09 '11 at 09:56
  • 1
    Hi Tim, I could only imagine this, when the index is either completely rebuild on every publish or if you have a huge amount of users doing search'es. Do you have a large amount of visitors or something special set up for the rebuild? Could you try and disable the update of the index and then do the same testing? It might be related to publishing in general and not just the indexing. – Jens Mikkelsen Oct 10 '11 at 06:59
  • @Jens, thanks we'll try to disable the index process. I don't feel like we are doing anything out of the ordinary in terms of content size or amount of visitors. The site is relatively small. But we need to know the performance will scale ongoing. If the index is completely rebuilt it works a lot lot quicker, around 5 seconds for a full rebuild. Updating each item one by one can take much much longer as you would image. – Tim Peel Oct 10 '11 at 11:05
  • Is there some reason you are doing a Full Site Publish instead of a Smart Publish? (I normally hide the Publish Site button) Additionally, I think this is another reason to use a workflow with an auto-publish as the last step, where the items can get indexed as necessary. In your case the dataset is small, but I think there is a good argument against reindexing everything when the customer has potentially millions of records. To follow on to @JensMikkelsen's comment, I have used indexes extensively and found them to be quite performant if you understand how Lucene works. – Patrick Jones Nov 26 '12 at 21:07

2 Answers2

2

I would recommend two things:

  1. Use the Advanced Database Crawler (v2 is the latest version) which wraps over the Sitecore.Search namespace. This makes it super easy to use Lucene.NET with Sitecore.

  2. Rebuild the indexes fully daily. This defragments the indexes as the fragmentation over time can reduce performance (which might be your issue here).

Mark Ursino
  • 31,209
  • 11
  • 51
  • 83
1

I've come across similar problems before. When I was analysed what was going on all of the time was spent in opening the index for every search.

The way we ended up solving it was by bypassing Sitecore's index classes and going direct to Lucene. Lucene provides a "Reopen" method which only opens the modified segment files, as opposed to all of the segment files like Sitecore does.

So what we did was:

  1. Open an Index Reader if we didn't have one already
  2. Create an application level reference to it, so that we can re-use it
  3. On each search call "Reopen" on the application index reader
  4. Search

Have a look at the Lucene.Net.Index.IndexReader.Reopen method Documentation

You can create an Index Reader from Sitecore.Search.Index.CreateReader()