Aggregating and deduplicationg information extracted from multiple web sites

Question

I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.

As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.

Suppose there are n entries to be inserted in the database, so for second entry you need 1 comparison, for third you need 2 comparison, for fourth 3 comparison and so on. So total number of comparisons=1+2+3+...+(n-1) i.e. O(n^2). — Joy, Jan 30 '13 at 10:49
You are looking at the effort of filling your whole database, while in the text you wrote: _for each new entry I have to do order of 100000^2 comparison_. No, for each new entry you have O(n) with a naive approach. Databases also have indexes which can help reduce the effort. If you think you can't use an index for checking similarity, I'd say you can. Just define your similarities good enough and fill an indexed column with the "reduced for similarity check" name. — class stacker, Jan 30 '13 at 11:04
Sorry I was wrong. In total I need O(n^2) comparison but not for each. — Joy, Jan 30 '13 at 14:08

score 1 · Answer 1 · answered Mar 02 '13 at 09:43

Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.

The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.

I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.

score 0 · Accepted Answer · answered Jan 30 '13 at 13:04

One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.

How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.

Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.

Aggregating and deduplicationg information extracted from multiple web sites

2 Answers2