What are the best practices to create a solr based de-duplication system?

Question

I am setting up a solr search based de-duplication system that would return search results matching the search criteria. I have used dataimport handler to pull data from database and create indexed documents on the Solr server.

My solr schema is as below:

<field name="customer_id" type="int" indexed="true" stored="true" required="true" />
<field name="fname" type="phonetic" indexed="true" stored="true" />
<field name="lname" type="phonetic" indexed="true" stored="true"/>
<field name="address" type="text_en" indexed="true" stored="true" />
<field name="city" type="string" indexed="true" stored="true"  />
<field name="state" type="string" indexed="true" stored="true"  />
<field name="zipcode" type="string" indexed="true" stored="true"  />
<field name="telephone" type="string" indexed="true" stored="true"  />

As seen above, I have specified the type of first name (fname) and last name (lname) fields as phonetic for phonetic search using DoubleMetaphoneFilterFactory. The description of phonetic field type is as below:

<fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="true"/>
  </analyzer>
</fieldtype>

I want my searches to return the documents that match all the specified query fields and not just either of the search fields.

My problem is that if I search for either fname, lname or address alone then the results are quite relevant but when I use filter query along with primary search query then the results contain union of results from both the search criteria.

Please can somebody point out what I am doing wrong. Also, are there any best practices to keep in mind to design a solr schema for such a de-duplication system for a bank that could identify duplicate customer record(s).

Thanks in advance!!

What are your queries and their responses ? – Jayendra Sep 04 '12 at 09:58 — Jayendra, Sep 04 '12 at 09:58

score 9 · Answer 1 · answered Sep 26 '12 at 09:25

If what you want is a customer deduplication system based on Lucene you may want to just use Duke instead. It's a general deduplication engine that uses Lucene to index up the records, and then does detailed comparisons using more sophisticated comparators like Levenshtein, Weighted Levenshtein, Jaro-Winkler, etc etc. It's got standard connectors for JDBC databases and suchlike, but you can also write your own, or even just supply the engine with data directly. Comparisons are based on combining probabilities with Bayes Theorem.

From my experience with writing Duke I'd say that you're going to have a hard time getting Lucene to do all the work for you. As you say, the search results are pretty good, but precision is not going to be anything like as good as what you get from an algorithm that's designed specifically for this.

So my recommendation to you would be to get a tool that's built for deduplication. I mentioned Duke because it's based on Lucene and so close to what you're trying to build, but you could really use any record linkage engine. Duke uses Lucene for performance (so we don't have to compare all record pairs), but other engines have other ways of achieving similar performance without using search, and I guess to you it doesn't really matter whether Lucene is inside or not. So any of the tools listed on the record linkage page linked above could work for you.

Note that this has been a huge research field for a couple of decades now, and people have made good progress on solving this. So the ready-made tools really are good. There's also a bunch of commercial tools for this, but since you've started building your own I've assumed those are not relevant.

Full disclosure: I'm the author of Duke. I guess we're not supposed to promote ourselves here, but, really, to me it sounds much better to use a ready-made package than to build your own. YMMV.

Have you considered contributing Duke to the Solr community? — Oskar Austegard, Sep 08 '14 at 19:35
I'm not sure it's a good idea, since Duke has many uses outside of Solr. There have been various attempts to build Solr plugins for it, which I think is a better way to go. — larsga, Sep 10 '14 at 08:42

score 0 · Answer 2 · answered Sep 04 '12 at 10:48

0

It seems the query you are building is something like

customer_id OR fname OR someOther

If you need to get other fields invloved you need to change the query something like(union od customer_id and fname with should operator)

(customer_id AND fname) OR someOther

You can search following sites for more infor

answered Sep 04 '12 at 10:48

Ruwantha

2,603
5
30
44

No, its not the case. I tried searching through solr admin interface and I am using both fields 'Query String' and 'Filter Query' which use 'AND' for the operation – Tushu Sep 04 '12 at 10:52
did you find a way to search the score of the results of the query. If they have high score and others have low score it is again possible to happen this.. – Ruwantha Sep 04 '12 at 10:55

What are the best practices to create a solr based de-duplication system?

2 Answers2