0

I am writing very large (both size and count) documents to a solr index(100s of fields with many numeric and some text fields) . I am using Tomcat 7 on W7 x64.

Based on @Maurico's suggestion when indexing millions of documents I parallelize the write operation (see code sample below)

The write to Solr method is being "Task"ed out from a main loop (Note: I task it out since the write op takes too long and holds up the main app)

The problem is that the memory consumption grows uncontrollably, the culprit is the solr write operations (when I comment them out the run works fine). How do I handle this issue? via Tomcat? or SolrNet?

Thanks for your suggestions.

        //main loop:
        {
               :
               :
               :
             //indexDocsList is the list I create in main loop and "chunk" it out to send to the task.
              List<IndexDocument> indexDocsList = new List<IndexDocument>();
              for(int n = 0; n< N; n++)
              {
                  indexDocsList.Add(new IndexDocument{X=1, Y=2.....});
                  if(n%5==0) //every 5th time we write to solr
                  {
                     var chunk = new List<IndexDocument>(indexDocsList);
                     indexDocsList.Clear();
                     Task.Factory.StartNew(() => WriteToSolr(chunk)).ContinueWith(task => chunk.Clear());
                     GC.Collect();
                   }
              }
      }

      private void WriteToSolr(List<IndexDocument> indexDocsList)
        {

            try
            {
                if (indexDocsList == null) return;
                if (indexDocsList.Count <= 0) return;
                int fromInclusive = 0;
                int toExclusive = indexDocsList.Count;
                int subRangeSize = 25;

                //TO DO: This is still leaking some serious memory, need to fix this 
                ParallelLoopResult results = Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, subRangeSize), (range) =>
                {
                    _solr.AddRange(indexDocsList.GetRange(range.Item1, range.Item2 - range.Item1));
                    _solr.Commit();
                });


                indexDocsList.Clear();
                GC.Collect();
            }
            catch (Exception ex)
            {
                logger.ErrorException("WriteToSolr()", ex);
            }
            finally
            {

                GC.Collect();
            };
            return;
        }
Mikos
  • 8,455
  • 10
  • 41
  • 72
  • IMHO this code is overly complicated... why not just use the code I posted on my blog? – Mauricio Scheffer Dec 02 '12 at 14:21
  • @Maurico - how would that make any difference? I am only use a different parallelization routine. – Mikos Dec 02 '12 at 14:28
  • I guess my concern is Tomcat seems to be chewing up a lot of memory, am I doing something fundamentally wrong? – Mikos Dec 02 '12 at 14:48
  • sorry, I understood that it was the .NET process eating up too much memory. About the code, simpler code makes it easier to reason about it. – Mauricio Scheffer Dec 02 '12 at 14:56
  • Thanks @Maurico, it is the _solr.AddRange(indexDocsList.GetRange(range.Item1, range.Item2 - range.Item1)); that seems to be gobbling up memory. I can see Tomcat memory grow...not sure how to handle it. any ideas? – Mikos Dec 02 '12 at 14:59
  • Is there any noticeable difference when you change the range size (up or down)? Does Tomcat actually run out of memory/crash? – Brendan Hannemann Dec 03 '12 at 18:20
  • @unicron, up/downing the range does make a difference, but cannot discern a pattern. Also the system quickly OOMs out. – Mikos Dec 03 '12 at 23:14

1 Answers1

3

You are manually committing after each batch. This is the most expensive operation for Solr. In your case, I would recommend autoCommit every x seconds and do a softAutoCommit (Solr 4.0) feature. That should take care of Solr's side of things. You'll also have to tweak your JVM garbage collection options so that you don't get pause the world GC.

zbugs
  • 591
  • 2
  • 10