0

I am doing record matching to find out possible duplicates records. A record said to be a duplicate of other record if (firstname and lastname) and (phone or email)). Name fields are compare either exact or fuzzy(distance, phonetic) and phone and email are compared as exact.

My approach is

  1. Indexed records(100000) in solr
  2. Get all records from solr
  3. Split records into multiple batches
  4. Assigned a thread to a batch using ExecutorService

executor.submit(new CompareTask(batch1);

  1. For each record in a batch performed a solr fuzzy search on firstname and lastname to get possible candidate of records.

for(Record r1: batch1) getCandidates();

Solr Query to get candidates firstname: smith~.05 OR lastname:james~0.5

  1. For each record in candidate records match fields and check duplication match(r1, r2)

Respective fields of both records are compared against each other and generate a score if score >= to pre defined threshold then r2 is duplicate of r1.

FYI: a record size would be less than 200 bytes

I need help on

  1. How can I make more performance oriented. match can be run on separate executor.
    1. I need to generate a report but I do not want to compromise with performance, how and where can I write matched records.
    2. What all can be done to make it better.

Thanks

VirtualLogic
  • 706
  • 1
  • 10
  • 25
  • Premature optimisation is the root of all evil - Prof Don Knuth. Have you benchmarked your algorithm above first? –  Feb 04 '15 at 18:42
  • Yes, IK I have done the benchmarking...do you need specific data points regarding benchmark? – VirtualLogic Feb 04 '15 at 18:44
  • It would good to present your results and also identify where you think there is a performance bottle neck or even why you think this is not the most optimal solution for your task. –  Feb 04 '15 at 18:46
  • 1. Loading all data in one go from solr. 2. For every records hitting solr to get candidate match. 3. Increasing pool size results in more threads in monitoring state. – VirtualLogic Feb 04 '15 at 18:50
  • What about using java streams? This way you don't need to load all the data first; just hold a data structure to keep all unique values as they stream in based on your duplicate checking predicate. You can use parallelstream to optimise that as well. Also have you explored creating hash values for your objects and using those hash values to compare for duplicates? –  Feb 04 '15 at 18:53

0 Answers0