-1

I have two columns and I would like to make a comparison to find out if they are equal, the problem is when same keyword is written in different way.

For example if column_1 = 123 Maryland Ave and column_2 = 123 Maryland Avenue these two column should be equal and I would like to create a third column in the query to show if they are equal or not, thank you for your help!

Column_1                   Column_2                  Equal?
-----------------------------------------------------------
 123 Maryland Ave           123 Maryland Avenue       Yes
 456 Maryland Ave           123 Maryland Ave          No
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Nari
  • 30
  • 1
  • 3
  • SoundEx is another alternative to littlefoot's response. Or you could use both and have a a threshold for both. But I've used this before and it's ok. The best option in my opinion is a to use an API to the USPS mail address Validation/normalization system for both columns and then compare. – xQbert Dec 02 '21 at 20:56
  • Please, describe **the rule** to consider some values as equal. Example is not a rule, computer cannot understand examples without implementation of some learning algorithm beforehand. You didn't explicitly mention if columns contain US address, then it's hard to provide any generic way to solve the problem. – astentx Dec 02 '21 at 21:07
  • Possible duplicate of https://stackoverflow.com/q/47529018/1509264 or https://stackoverflow.com/q/38154986/1509264 – MT0 Dec 02 '21 at 21:09

1 Answers1

0

One option is to check similarity between those values:

SQL> with test (id, col1, col2) as
  2    (select 1, '123 Maryland Ave', '123 Maryland Avenue' from dual union all
  3     select 2, '456 Maryland Ave', '123 Maryland Ave'    from dual
  4    )
  5  select id, col1, col2,
  6    utl_match.jaro_winkler_similarity(col1, col2) as sim
  7  from test;

        ID COL1             COL2                       SIM
---------- ---------------- ------------------- ----------
         1 123 Maryland Ave 123 Maryland Avenue         96
         2 456 Maryland Ave 123 Maryland Ave            87

SQL>

Now, you have to decide the threshold which satisfies your needs. Is it 90%? Let's suppose it is. Then you'd use CASE expression:

SQL> with test (id, col1, col2) as
  2    (select 1, '123 Maryland Ave', '123 Maryland Avenue' from dual union all
  3     select 2, '456 Maryland Ave', '123 Maryland Ave'    from dual
  4    )
  5  select id, col1, col2,
  6    case when utl_match.jaro_winkler_similarity(col1, col2) > 90 then 'Yes'
  7         else 'No'
  8    end as equal
  9  from test;

        ID COL1             COL2                EQUAL
---------- ---------------- ------------------- -------
         1 123 Maryland Ave 123 Maryland Avenue Yes
         2 456 Maryland Ave 123 Maryland Ave    No

SQL>
Littlefoot
  • 131,892
  • 15
  • 35
  • 57
  • Thank you very much, this worked and helped a lot! I have one quick question. I have a situation on having two addresses: 123 East Johnson Street, Apt A ----- and----- 123 E Johnson StAptB. These two should be the same but the percentage returns below 90% accuracy. Is there away to search for words like Street and treat it like St, also do not count , as well. In python these seemed to be a little bit easier. Thank you again! – Nari Dec 07 '21 at 20:21
  • You could use REPLACE and do e.g. `replace(address, 'street', 'st')` and then compare such values. Mind letter case! – Littlefoot Dec 07 '21 at 20:22