Duplicates in Solr index - items added twice or more times

Question

Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.

Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.

If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.

The web server which triggers the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.

Here are some of the config elements in solrconfig.xml:

<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>

<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>

<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.

What might cause this duplicates problem? Thanks for all input!

Umar · Accepted Answer · 2012-08-31T19:31:01.577

6

To answer your question completely i should be able to know the schema. There is a unique id field in the schema that works more like the unique key in the db, make sure the unique identifier of the document is made the unique key then the duplicates will be overwritten to keep just one value.

edited Aug 31 '12 at 19:31

answered Jul 05 '11 at 12:14

Umar

2,819
20
17

The unique identifier thing is what this was all about. Cheers! – Martin S Ek Jul 07 '11 at 12:58
i got two different tables from my db which i put into my solr-index. both have an unique key named "id". Now my two tables merge together in solr and entries disappier because of overwritting. how to solve this? – Rubinum Jun 11 '15 at 11:43
do you really need to merge these records in a single name space? If the records represent different entities you should consider using two separate cores. Otherwise what you cn do is to prefix table to the id to make the ids unique across tables. – Umar Jun 12 '15 at 04:59

Yuriy · Answer 2 · 2011-07-07T07:43:46.747

4

It is not possible to have two documents with identical value in their field marked as a unique id in the schema. Adding two documents with the same value will just result in the latter one overwriting (replacing) the previous.

So it sounds like it is your mistake and the documents are not really identical.

Make sure your schema and id fields are correct.

edited Jul 07 '11 at 07:43

answered Jul 05 '11 at 14:17

Yuriy

1,964
16
23

score 1 · Answer 3 · answered Jul 06 '11 at 11:02

As a completion to what was said above, a solution, in this case, can be to generate a unique ID (or to define one of the fields as an unique ID) for the document from code, before sending it to SOLR.

In this case you make sure that the document you want to update will be overwritted and not recreated.

score 0 · Answer 4 · answered Jul 06 '11 at 09:53

Actually, all added documents will have an auto generated unique key, through Solr's own uuid type:

<field name="uid" type="uuid" indexed="true" stored="true" default="NEW"/>

So any document added to the index will be considered a new one, since it gets a GUID. However, I think we've got a problem with some other code here, code that adds items to the index when they are updated, instead of just updating them..

I'll be back! Thanks so far!

score 0 · Answer 5 · answered Jul 07 '11 at 12:04

0

Ok, it turned out there was a couple of bugs in the code updating the index. Instead of updating, we always had a document added to index, even tho it already existed.

It wasn't overwritten because every document in our Solr index has its own GUID.

Thank you for your answers and time!

answered Jul 07 '11 at 12:04

Martin S Ek

2,043
2
16
22

1

can you provide some more details like the curl command you used to index the doc as well as the schema.xml and solrconfig.xml. IT would make this page lot more useful. – Sanjay Rao May 10 '13 at 17:29

Duplicates in Solr index - items added twice or more times

5 Answers5