I am doing record matching to find out possible duplicates records. A record said to be a duplicate of other record if (firstname and lastname) and (phone or email)). Name fields are compare either exact or fuzzy(distance, phonetic) and phone and email are compared as exact.
My approach is
- Indexed records(100000) in solr
- Get all records from solr
- Split records into multiple batches
- Assigned a thread to a batch using ExecutorService
executor.submit(new CompareTask(batch1);
- For each record in a batch performed a solr fuzzy search on firstname and lastname to get possible candidate of records.
for(Record r1: batch1) getCandidates();
Solr Query to get candidates firstname: smith~.05 OR lastname:james~0.5
- For each record in candidate records match fields and check duplication match(r1, r2)
Respective fields of both records are compared against each other and generate a score if score >= to pre defined threshold then r2 is duplicate of r1.
FYI: a record size would be less than 200 bytes
I need help on
- How can I make more performance oriented. match can be run on separate executor.
- I need to generate a report but I do not want to compromise with performance, how and where can I write matched records.
- What all can be done to make it better.
Thanks