Fast data comparing between two tables

Question

I'm using Oracle11g and i would compare two tables finding records that match between them.

Example:

Table 1        Table 2

George         Micheal
Michael        Paul

The record "Micheal" and "Michael" match between them, so they are good record.

To see if two records match, i use the Oracle function utl_match.edit_distance_similarity.

I tried with the code below, but i have a performance problem (it is too slow):

SELECT * 
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;

Is there a better solution?

Thank you

Join on coulmn names. `SELECT t1.* FROM table1 t1 JOIN table2 t2 ON t1.col1 = t2.col1` — XING, Feb 02 '17 at 10:54
This not cover all possible matches (see Michael and Micheal) — user2520969, Feb 02 '17 at 11:03
Yes but they both are not same. So you are looking or possible match. In tat case you can match first 3 or 4 charatcters and check. — XING, Feb 02 '17 at 11:06
I believe that the first step would be deciding whether you need exactly the logic that `edit_distance_similarity` implements or not. If you need exactly what the function does, I believe it would be hard to implement that function in a more efficient way; if something different could be right, please post the logic you may need — Aleksej, Feb 02 '17 at 11:13
How many different names do you have in Table1 and Table2? And how many entries at all? — Frank Ockenfuss, Feb 02 '17 at 11:42
The query needs to make about 1 trillion comparisons with a slow function, so it's not surprising that it will take a very long time. Your best chance is if there are a lot of duplicate names (see Frank's answer); it is not likely that you have 1 million distinct names. — , Feb 02 '17 at 20:32

score 1 · Answer 1 · answered Feb 02 '17 at 11:49

1

This is a hard problem. In general, it is going to result in nested loop joins and slowness. It might be possible to use SOUNDEX() to get "closish" matches and then the character distance function for final filtering. This may not work for your problem, but it might.

Although I am not a big fan of the function, you might find that soundex() works for your purposes (see here).

The idea would be to add an index on this value:

create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));

Then you would query this as:

SELECT * 
FROM table1 t1 JOIN
     table2 t2
     ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;

The idea is that Oracle will use the indexes to get names that are "close" and then the edit distance to get the better matches. This may not work for your problem. It is just an idea that might work.

answered Feb 02 '17 at 11:49

Gordon Linoff

1,242,037
58
646
786

A similar approach to Gordon's might be to join on the first character of each string to get "closish", such as substr( t1.name,1,1) = substr(t2.name,1,1). This will probably do a sort merge join. If you then have horsepower and bandwidth, you can throw parallelism at it. – BobC Feb 02 '17 at 13:46
@BobC - but that will miss good matches if the first character is the only one that's different, right? as in Heller and Geller. – Feb 02 '17 at 20:28
The `soundex()` idea is good, but you should mention explicitly that it is English-centric. Two names may sound very similar in English but not in Czech or Korean, and vice versa. Also, names that are spelled very similarly may sound very different in English, resulting in missed "close matches." – Feb 02 '17 at 20:31
@mathguy Yup, it certainly will. So it's a compromise. However, Gordon's soundex() solution suffers from the same problem as it's heavily leading character biased. Otherwise you have Cartesian join. – BobC Feb 02 '17 at 20:35
@BobC . . . I'm not sure what you mean by "Cartesian join". The indexes are designed to speed up the processing. – Gordon Linoff Feb 03 '17 at 02:34
@GordonLinoff My point was that the question as originally posted would require a cartesian join, as you have to compare every name in table1 with every name in table2. Using soundex() or the first letter as a join condition as a means to reduce the "join space" (ie it will no longer be a cartesian product) both suffer from the same issue in that they will eliminate potentially "similar" names that differ in the first letter, as Mathguy points out. Whether or not an index is a good idea performance wise is a different question, and that is going to depend on selectivity of the soundex. – BobC Feb 03 '17 at 04:53

Frank Ockenfuss · Answer 2 · 2017-02-06T14:58:35.907

In case you have a lot of redundancy with respect to name values in your tables table1 and table2, this could be a solution

-- Test data set

select count(*) from table1;
--> 10.000

select count(*) from table2;
--> 10.000

select count(distinct(name)) from table1;
--> ~ 2500

select count(distinct(name)) from table2;
--> ~ 2500

/* a) Join with function compare */

select table1.name, table2.name
  from table1, table2
 where utl_match.edit_distance_similarity(table1.name, table2.name) > 35

/*

--------------------------------------------------------------------------------
| Id  | Operation            | Name   | Rows    | Bytes     | Cost  | Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   1 |   NESTED LOOPS       |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   2 |    TABLE ACCESS FULL | TABLE1 |   10000 |    270000 |     5 | 00:00:01 |
| * 3 |    TABLE ACCESS FULL | TABLE2 |     500 |     13500 |     4 | 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)


Note
-----
- dynamic sampling used for this statement

*/

/* b) Join with function, only distinct values */

-- A Set of all existing names (in table1 and table2)
 with names as
 (select name from table1 union select name from table2),

-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
 table_cmp(name1, name2) as
 (select n1.name, n2.name
          from names n1
          join names n2
            on n1.name <= n2.name
           and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)

  select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name1
          join table2 t2
            on t2.name = c.name2
        union all
        select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name2
          join table2 t2
            on t2.name = c.name1;


/*

--------------------------------------------------------------------------------------------------------------
| Id   | Operation                   | Name                        | Rows     | Bytes      | Cost | Time     |
--------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT            |                             | 30469950 | 3290754600 | 2495 | 00:00:38 |
|    1 |   TEMP TABLE TRANSFORMATION |                             |          |            |      |          |
|    2 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663E_B39FC2B6 |          |            |      |          |
|    3 |     SORT UNIQUE             |                             |    20000 |     540000 |   12 | 00:00:01 |
|    4 |      UNION-ALL              |                             |          |            |      |          |
|    5 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|    6 |       TABLE ACCESS FULL     | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
|    7 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663F_B39FC2B6 |          |            |      |          |
|    8 |     MERGE JOIN              |                             |  1000000 |   54000000 |   62 | 00:00:01 |
|    9 |      SORT JOIN              |                             |    20000 |     540000 |    3 | 00:00:01 |
|   10 |       VIEW                  |                             |    20000 |     540000 |    2 | 00:00:01 |
|   11 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
| * 12 |      FILTER                 |                             |          |            |      |          |
| * 13 |       SORT JOIN             |                             |    20000 |     540000 |    3 | 00:00:01 |
|   14 |        VIEW                 |                             |    20000 |     540000 |    2 | 00:00:01 |
|   15 |         TABLE ACCESS FULL   | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
|   16 |    UNION-ALL                |                             |          |            |      |          |
| * 17 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   18 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 19 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   20 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   21 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   22 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
| * 23 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   24 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 25 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   26 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   27 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   28 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")


Note
-----
- dynamic sampling used for this statement

*/

Are you not just trading two sorts (SELECT DISTINCT) in you CTE to what would probable be a SORT MERGE join anyway? — BobC, Feb 03 '17 at 09:53

Fast data comparing between two tables

2 Answers2