-1

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.

Following is the content of mentioned 6 columns.

Name, address, city, post-code, phone number, machine number

Data does not have fixed length, data might in certain columns might be missing in certain instances.

I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match. So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.

I have 2 tasks in hand w.r.t. the data.

  1. Flag duplicate rows with certain identifier
  2. Mention the percentage match between similar rows.

I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.

Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73
Tanumoy
  • 3
  • 4
  • What do you mean by duplicates? Because you mention similar rows then. – Hynek -Pichi- Vychodil Mar 23 '15 at 12:28
  • What i am trying to state is, the data is about certain entity. while collecting information somebody might have used the exact data that was used earlier, which would make it duplicate. In another case information collector might have used short forms or maybe missed information here and there which would make the data similar rather than duplicate. – Tanumoy Mar 23 '15 at 12:31
  • It's very vague definition for computer programm. – Hynek -Pichi- Vychodil Mar 23 '15 at 13:13

1 Answers1

3

You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.

Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.

Once the import is done, add indexes on the columns you want to check for duplicates.

CREATE INDEX name ON whatever (name);

You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.

SELECT id
FROM   whatever t1
JOIN   whatever t2 ON t1.id < t2.id
WHERE  t1.name = t2.name

PostgreSQL has powerful string functions including regexes to do the comparisons.

Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks @asjo for pointing that out)

CREATE INDEX ON whatever ((lower(name)));

// This will be muuuuuch faster
SELECT id
FROM   whatever t1
JOIN   whatever t2 ON t1.id < t2.id
WHERE  lower(t1.name) = lower(t2.name)

A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.

Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.

UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');

Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to @asjo).

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Are you sure you can run this select on table which _has more than million rows in it_ and expect a result in less then 3 days? – Hynek -Pichi- Vychodil Mar 23 '15 at 17:32
  • @Hynek-Pichi-Vychodil Yes, a million rows isn't much for a modern computer. Might have to throw some indexes on the columns you're checking for duplicates. It's not the most performant thing, but fortunately the OP only needs to do it once. – Schwern Mar 23 '15 at 19:48
  • But in your case you will do `0.5*n*(n-1)` comparisons. Still no problem? – Hynek -Pichi- Vychodil Mar 23 '15 at 19:56
  • 1
    @Hynek-Pichi-Vychodil The indexed version returns in 2 seconds for 500,000 rows. Once you go off indexes the trouble starts. Got a better algorithm? Let's hear your answer. – Schwern Mar 23 '15 at 19:57
  • If I would know the answer I would write it. But your answer is no answer. Everybody can make the wrong answer. It is why I ask the question in comments below the OP. – Hynek -Pichi- Vychodil Mar 23 '15 at 20:00
  • @Hynek-Pichi-Vychodil The OP can run the slow queries while they're waiting around for someone to post a better algorithm. It's not always about raw performance; with a one-shot task like this it's about just getting it done. – Schwern Mar 23 '15 at 20:01
  • 2
    For fuzzy matching the trigram contrib module pg_trgm might be of use: http://www.postgresql.org/docs/9.4/static/pgtrgm.html – asjo Mar 23 '15 at 20:07
  • There is clearly stated _The real data has more than million rows in it_ in the OP. And also _So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY"_. I'm DWH expert with 12 years of experience. I even write special DB engines for special purposes and I can tell you, It is nearly impossible task without any further information. Especially if someone suggest `O(N^2)` algorithm for one million task, then the one doesn't know what is talking about. – Hynek -Pichi- Vychodil Mar 23 '15 at 20:08
  • 2
    Note that you can make indexes on functions in Postgres, so you don't have to create a column with LOWER(name) in it, you can just make an index of LOWER(name). – asjo Mar 23 '15 at 20:11
  • @asjo Thanks, I've incorporated them into my answer. I've never used pg_trgm, could you post an answer featuring that module? – Schwern Mar 23 '15 at 20:25
  • @Hynek-Pichi-Vychodil How do you get to the conclusion that this will be O(n²)? The database uses indexes to find matches - it doesn't have to compare with each and every other record. – asjo Mar 23 '15 at 20:43
  • @asjo: Original answer didn't contain index. See edits. – Hynek -Pichi- Vychodil Mar 23 '15 at 20:47
  • 1
    @Hynek-Pichi-Vychodil As a database expert with many years of experience, you certainly know that indexes are commonly used? – asjo Mar 23 '15 at 20:50
  • @Hynek-Pichi-Vychodil By taking a shot at an answer, and improving it with user feedback, we came to a better answer. How about that? Next time, relax and work the problem a bit before declaring it impossible. I suggest we clean up this comment thread. I'll delete my half of the conversation once yours is gone. – Schwern Mar 23 '15 at 20:52
  • @Schwern: You are making excuses. Without knowing what similarity means you can't answer this question. If the similarity is edit distance (Levenshtein, Damerau-Levenshtein, Wagner-Fischer and so on) your solution still doesn't work, because you can't use an index anyway. So I see a very little improvement. You now know, without index you can't solve task and for some tasks there is not proper index. – Hynek -Pichi- Vychodil Mar 23 '15 at 21:02
  • @Hynek-Pichi-Vychodil That's exactly what the [pg_trgm indexes](http://www.postgresql.org/docs/9.4/static/pgtrgm.html) do, index similarity by distance. Your objections are valid, but you keep telling us it's impossible and to stop trying; we keep working the problem and get a better answer. This isn't a competition. Try not talking and listening instead, I've learned a lot in this answer. – Schwern Mar 23 '15 at 21:05
  • @Schwern You should read more carefully. pg_trgm index uses trigram similarity. If you look in mine comment I was writing about different similarities and even enumerate some of them. I see you learned a lot. Add an index for example. I would like to be your customer and you will learn on me doing even basic stuff. – Hynek -Pichi- Vychodil Mar 23 '15 at 21:11
  • 1
    @Hynek-Pichi-Vychodil You are not being constructive. When a question is unclear, providing a workable answer for a subset of the question, or a clearly defined variation of the question, is valuable. – asjo Mar 23 '15 at 21:13
  • @asjo Look, I have seen so called professionals going around server with the price tag hundred of thousand bucks and able run query again and again day by day waiting once timeout set to 15 hours will not kill. Sorry, may be overreacting. – Hynek -Pichi- Vychodil Mar 23 '15 at 21:18
  • BTW In OP used example `google` and `gugl` has 0.083 similarity which is considered as almost zero by any meaning. So even using pg_trgm is not solving task described in OP. – Hynek -Pichi- Vychodil Mar 23 '15 at 22:47
  • @Hynek-Pichi-Vychodil `google` and `gugl` have the same [metaphone codes](https://en.wikipedia.org/wiki/Metaphone), `KKL`, which is one of the [fuzzy comparison algorithms](http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html) mentioned in my answer. If you don't like the parameters of the question, ask the OP to clarify them. I'm dealing with the uncertainty by offering the OP options. – Schwern Mar 23 '15 at 22:58