Trying to remove duplicate records from Solr

Question

I am using Solr 5.2.0 with 2 shards per core and 2 nodes per shard. Each shard is configured in clusterstate.json to have a range, to divide records among the shards, like this:

"shard1": {"range": "0-7fffffff"}, "shard2": {"range": "80000000-ffffffff"}

Due to an issue with shard assignments at the node level, some data got replicated from a shard1 node to a shard2 node. Currently there are some duplicate records in each core, namely the data replicated from shard1 to shard2 nodes. When updates are sent to solr, the record (if it belongs in shard1) is properly updated on shard1 nodes, but there is still an [old] copy of some shard1 records in the shard2 nodes, so solr will (depending on latency it seems) return an older version of the record if the shard2 node responds to the request.

I'm attempting to see if there is some way to find these duplicate records that should not exist in shard2 nodes and remove them.

I've attempted to do some facet searches, but didn't have any positive results in finding the duplicates that way. But chief among those efforts is the ability to not only find the duplicates, but find the duplicates who should not be on that shard (based on the range definitions), and only delete those records.

Alternatively, a query to see if the record on the node should actually be on that node (again, based on the range), where I could simply delete by query.

score 0 · Answer 1 · answered Mar 06 '17 at 02:38

0

Query each shard with the parameter distrib=false. This restricts the result to the data on that shard. I would dump the ids from each shard to a file and find duplicates.

answered Mar 06 '17 at 02:38

Anand

81
3

Trying to remove duplicate records from Solr

1 Answers1