Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.
Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.
If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.
The web server which triggers the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.
Here are some of the config elements in solrconfig.xml:
<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>
I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.
What might cause this duplicates problem? Thanks for all input!