0

I have two tables, first called list_city contain list of city in Europe, example:

=========================
= ID = CITY   = COUNTRY =
=========================
= 1  = LONDON = UK      =
= 2  = PARIS  = France  =
= 3  = ROME   = Italy   =
=========================

and the second one is data_customer, contain id_cust and city (but in mess typing)

====================
= ID_CUST = CITY   =
====================
= 012AGH  = paris  =
= 2X4BV   = London =
= M3RT45  = romE   =
= 1F546   = Lndon  =
= 345GC   = PArs   =
= 54A78   = roma   =
====================

How I can matching in the city in the second table with the city in the first table in mysql so the output will be:

====================
= ID_CUST = CITY   =
====================
= 012AGH  = PARIS  =
= 2X4BV   = LONDON =
= M3RT45  = ROME   =
= 1F546   = LONDON =
= 345GC   = PARIS  =
= 54A78   = ROME   =
====================

I tried

SELECT c1.ID_CUST, c2.CITY FROM data_customer c1 INNER JOIN list_city c2
ON c1.CITY LIKE concat('%',c2.CITY,'%');

But, It's seem not work as Iexpected. It's possible do it only in mysql? If not can anyone give me another solution? Thank you very much I really appreciate your time to help me.

Arthur
  • 17
  • 6
  • If you want to uppercase, use `UPPER()` – marmeladze Nov 18 '22 at 03:34
  • Not only the uppercase, but some row contain incomplete word or wrong misspelling – Arthur Nov 18 '22 at 03:35
  • If you want the tupels from data_customer which have a corresponding entry with different upper/lower case in list_city this might work: `SELECT c1.ID_CUST, c2.CITY FROM data_customer c1, list_city c2 WHERE UPPER(c1.CITY)=UPPER(c2.city);` For finding entries with typos you might use `SOUNDEX` insted of `UPPER`? – Matthias Radde Nov 18 '22 at 05:06
  • Thank you for your answer, but my objective is to change city in data_customer to have similar format, all typo corrected, based on city in list_city, so I can easily group data customer by city – Arthur Nov 18 '22 at 05:52
  • @MatthiasRadde Explicit UPPER() is obviously excess. You'd use explicit COLLATE instead. And try do not use ancient comma-style joins. – Akina Nov 18 '22 at 06:04

1 Answers1

1

Use Levenshtein distance.

SELECT DISTINCT
       data_customer.id_cust,
       FIRST_VALUE(list_city.city) OVER ( PARTITION BY data_customer.id_cust 
                                         ORDER BY levenshtein(UPPER(list_city.city), UPPER(data_customer.city))) city
FROM list_city
CROSS JOIN data_customer;

Of course you must create user-defined function:

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
RETURNS INT
DETERMINISTIC 
BEGIN 
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
  DECLARE s1_char CHAR; 
  DECLARE cv0, cv1 VARBINARY(256); 
  SET s1_len = CHAR_LENGTH(s1), 
      s2_len = CHAR_LENGTH(s2), 
      cv1 = 0x00, 
      j = 1, 
      i = 1, 
      c = 0; 
  IF s1 = s2 THEN 
    RETURN 0; 
  ELSEIF s1_len = 0 THEN 
    RETURN s2_len; 
  ELSEIF s2_len = 0 THEN 
    RETURN s1_len; 
  ELSE 
    WHILE j <= s2_len DO 
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), 
          j = j + 1; 
    END WHILE;
    WHILE i <= s1_len DO 
      SET s1_char = SUBSTRING(s1, i, 1), 
          c = i, 
          cv0 = UNHEX(HEX(i)), 
          j = 1;
      WHILE j <= s2_len DO 
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN 
          SET cost = 0; 
        ELSE 
          SET cost = 1;
        END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN
          SET c = c_temp;
        END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN
          SET c = c_temp;
        END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), 
            j = j + 1;
      END WHILE;
      SET cv1 = cv0, 
          i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END;

fiddle

Akina
  • 39,301
  • 5
  • 14
  • 25