I am working with solrcloud now, but I am facing a problem which could cause indexing process hang.
My deployment is only one collection having 5 shard running at 5 machine. Every day we will do a full index using dataimporthandler, which have 50m docs. and we trigger indexing at one of 5 machine, using distribute indexing of solrcloud.
I have founded that, sometimes one of 5 machine will die, cause of
2013-01-08 10:43:35,879 ERROR core.SolrCore - java.io.FileNotFoundException: /home/admin/index/core_p_shard2/index/_31xu.fnm (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
at org.apache.lucene.codecs.lucene40.Lucene40FieldInfosReader.read(Lucene40FieldInfosReader.java:52)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:101)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:57)
at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:120)
at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:267)
at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:3010)
at org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:180)
at org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:310)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:386)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1445)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:448)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:230)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
and I have check index dir, which does not contain _31xu.fnm indeed. I am wondering it's there some concurrent bug in distribute indexing?
As far as I konw, distribute indexing is work like this. you can send docs to any shard, and docs will forword to correct shard according to a hash id. and dataimporthandler will forward docs to correc shard using updatehandler. and finally docs will be flushed to disk via DocumentsWriterPerThread. I am wondering it's there are too much update request which sended from the shard triggered indexing caused the problem. My guess is based on that I found at the machine whild died has a lot of index segment, and each of them is very small.
I am not familiar with solr too much, may be my guess has no meaning at all, does anyone have some idea? thanks