1

I have two columns which i want to compare, and if the strings are the same, with a maximum difference or LACK of one character, i want to make a flag to it. So for example:

select
    ,name1
    ,name2
    ,case when "name1 is like name2 except only 1 different character, or 
               lack of 1 character compared to the other" then 1 
          else 0 
     end same_flag
from example

Example output:

name1    -     name 2   -  sameflag   

john     -     jon      -        1    
sara     -     sarah    -        1    
filip    -     filis    -        1    
phillip  -     philis   -        0

I want it to work vica-versa. So name1 can be different from name2, but in an other row name2 can be different from name1.

Looz
  • 377
  • 2
  • 14

3 Answers3

2

You could pick one of the functions from the utl_match package:

with data (name1, name2) as (
  select'john','jon' from dual union all    
  select'sara','sarah' from dual union all    
  select'filip','filis' from dual union all    
  select'phillip','philis' from dual 
)
select name1, name2, 
       utl_match.edit_distance(name1, name2) as ed,
       utl_match.edit_distance_similarity(name1, name2) as ed_similarity,
       utl_match.jaro_winkler(name1, name2) as jw,
       utl_match.jaro_winkler_similarity(name1, name2) as jw_similarity
from data;

returns:

NAME1   | NAME2  | ED | ED_SIMILARITY | JW   | JW_SIMILARITY
--------+--------+----+---------------+------+--------------
john    | jon    |  1 |            75 | 0.93 |            93
sara    | sarah  |  1 |            80 | 0.96 |            96
filip   | filis  |  1 |            80 | 0.92 |            92
phillip | philis |  2 |            72 | 0.91 |            90

Depending on your needs and how you like the results you can do something like:

case when utl_match.edit_distance(name1, name2) < 2 then 1 else e end

Or use the percentage as threshold:

case when utl_match.edit_distance_similarity(name1, name2) > 75 then 1 else e end
  • Thank you! I only needed this line: case when utl_match.edit_distance(name1, name2) < 2 then 1 else e end – Looz Jun 06 '18 at 14:06
  • Can you please answer one more question? I want to understand that. On the link you attached. the first example says: The Edit Distance between strings "shackleford" and "shackelford" = 2. Thats easy to understand, we need to change the 'e' to 'l' and the 'l' to 'e'. Thats two steps. But after that, in a table it says: The Edit distance between Dunningham and Cunnigham is 80. How can it be 80?? Its there btw: https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm#CHDJAIJD – Looz Jun 06 '18 at 14:24
  • That's the result of the `similarity` version which is defined as "*returning a value between 0 (no match) and 100 (perfect match)*" –  Jun 06 '18 at 14:27
1

This is very linear - just loops over the letters and counts the differences.

I've updated this - now Richard and Rchard would be considered the same...

  FUNCTION compare_strings
   (P_string1        IN VARCHAR2
   ,P_string2        IN VARCHAR2)
  RETURN NUMBER
  IS

    l_long_string    VARCHAR2(100) ;
    l_short_string   VARCHAR2(100) ;
    l_diff_count     NUMBER := 0 ;

    l_result         NUMBER ;

    j                NUMBER := 1 ;
    k                NUMBER := 1 ;

  BEGIN

    IF LENGTH(P_string1) >= LENGTH(P_string2) THEN
      l_long_string := P_string1 ;
      l_short_string := P_string2 ;
    ELSE
      l_long_string := P_string2 ;
      l_short_string := P_string1 ;
    END IF ;


    --if one string is more than one char longer than the other then we must
    --have a difference
    IF LENGTH(l_long_string) - LENGTH(l_short_string) > 1 THEN
      l_result := 0 ;
    END IF ;


    FOR i IN 1..LENGTH(l_long_string) LOOP


     IF NVL(SUBSTR(P_string1,j,1),'##') != NVL(SUBSTR(P_string2,k,1),'##') THEN
       l_diff_count := l_diff_count + 1 ;
       --shift along one letter in the long string but stay put in the short string
       j := j + 1 ;
     ELSE
       --shift along on both strings
       j := j + 1 ;
       k := k + 1 ;
     END IF ;
     --EXIT WHEN l_diff_count > 1 ;


    END LOOP ;

    IF l_diff_count > 1 THEN
      l_result := 1;
    ELSE
      l_result := 0 ;
    END IF ;

    RETURN(l_result) ;
    --RETURN(l_diff_count) ;

  END compare_strings ; 
Christian Palmer
  • 1,262
  • 8
  • 10
  • Thinking about it I am not sure this is really a great solution. I am guessing that you intend Richard and Rchard to be considered the same because the second is missing a single i, but my function would say there are 5 differences. – Christian Palmer Jun 06 '18 at 12:28
  • I understand your logic, but i dont have the authority to write functions. I need to do this in a simple 'case when' – Looz Jun 06 '18 at 12:30
  • and Yes, thats fine, in my case Richard and Rchard would be the same – Looz Jun 06 '18 at 12:34
  • I've updated my answer - it's better now. I can see you're looking for a pure SQL solution but I don't know if that is going to be at all easy. CASE can mimic IF in PL/SQL but the LOOP is harder to recreate - you'll need nested CASE statements - at least one level per letter in the shortest string. And you won't know this length until runtime – Christian Palmer Jun 06 '18 at 12:47
  • Thinking about it you might be able to use a recursive query to loop over the characters, using the LEVEL pseudocolumn to give you the position for the SUBSTR. I am really sorry but I don't have enough time right now to think that through right now – Christian Palmer Jun 06 '18 at 12:54
  • also, it might be an idea to remove the PLSQL tag from your question - you are after a pure SQL answer – Christian Palmer Jun 06 '18 at 13:00
  • See also **UTL_MATCH.EDIT_DISTANCE** https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm – kfinity Jun 06 '18 at 13:33
0

Try this and adapt to count lengths in both and compare.

How to find count and names of distinct characters in string in PL/SQL

Not actually my answer, but this gives the basis to compute lenght and compare differences numerically.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83