How to get almost matching string from Oracle table?

Question

I have a table in Oracle with four columns. Table Data in Oracle.

Now user can enter input string as "operation Knee right" (which is valid) to my query and my query should return the ICD Code (IKR123) which matches most of the word in DiagnosisName column.

Following is my current query.(Not giving the proper output)

SELECT diagnosisname
FROM
  (SELECT diagnosisname,
    UTL_MATCH.jaro_winkler_similarity('%operation Knee right%',diagnosisname)
  FROM icd_code
  ORDER BY UTL_MATCH.EDIT_DISTANCE_SIMILARITY('%operation Knee right%',diagnosisname) DESC
  )
WHERE ROWNUM<2;

This query giving me the output as "Left Knee Operation" but my expectation is "Right Knee Operation".

score 8 · Accepted Answer · answered Apr 29 '15 at 06:34

There are few things to note about your usage of UTL_MATCH:

EDIT_DISTANCE_SIMILARITY : Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match.
JARO_WINKLER_SIMILARITY : Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match but tries to take into account possible data entry errors.

ORDER BY UTL_MATCH.EDIT_DISTANCE_SIMILARITY('%operation Knee right%',diagnosisname) DESC

This is not going to give you correct result. Since, you are only considering the possible similarity, however, not considering the data entry errors. So, you must use JARO_WINKLER_SIMILARITY.

operation Knee right

You need to keep in mind the CASE of the input and the column values to compare. They must be in similar case for a correct match. You are passing the input in LOWERCASE, however, your column values are in INITCAP. Better convert both the column values and the input to a similar case.

Let's look at the below demonstration to understand:

SQL> WITH DATA AS(
  2  SELECT 'Heart Operation' diagnosis_name, 'IH123' icd_code FROM dual UNION ALL
  3  SELECT 'Knee Operation' diagnosis_name, 'IK123' icd_code FROM dual UNION ALL
  4  SELECT 'Left Knee Operation' diagnosis_name, 'IKL123' icd_code FROM dual UNION ALL
  5  SELECT 'Right Knee Operation' diagnosis_name, 'IKR123' icd_code FROM dual UNION ALL
  6  SELECT 'Fever' diagnosis_name, 'IF123' icd_code FROM dual
  7  )
  8  SELECT t.*,
  9    utl_match.edit_distance_similarity(upper(diagnosis_name),upper('operation Knee right')) eds,
 10    UTL_MATCH.jaro_winkler_similarity (upper(diagnosis_name),upper('operation Knee right')) jws
 11  FROM DATA t
 12  ORDER BY jws DESC
 13  /

DIAGNOSIS_NAME       ICD_CO        EDS        JWS
-------------------- ------ ---------- ----------
Right Knee Operation IKR123         20         72
Knee Operation       IK123          20         70
Heart Operation      IH123          25         68
Left Knee Operation  IKL123         25         64
Fever                IF123          15         47

SQL>

So, you see how both are different from each other. jaro_winkler_similarity does a better job in identifying the data entry errors and giving the most close match. Based on that, simply choose the first row after sorting it in descending order:

SQL> WITH DATA AS(
  2  SELECT 'Heart Operation' diagnosis_name, 'IH123' icd_code FROM dual UNION ALL
  3  SELECT 'Knee Operation' diagnosis_name, 'IK123' icd_code FROM dual UNION ALL
  4  SELECT 'Left Knee Operation' diagnosis_name, 'IKL123' icd_code FROM dual UNION ALL
  5  SELECT 'Right Knee Operation' diagnosis_name, 'IKR123' icd_code FROM dual UNION ALL
  6  SELECT 'Fever' diagnosis_name, 'IF123' icd_code FROM dual
  7  )
  8  SELECT diagnosis_name
  9  FROM
 10    (SELECT t.*,
 11      utl_match.edit_distance_similarity(upper(diagnosis_name),upper('operation Knee right')) eds,
 12      UTL_MATCH.jaro_winkler_similarity (upper(diagnosis_name),upper('operation Knee right')) jws
 13    FROM DATA t
 14    ORDER BY jws DESC
 15    )
 16  WHERE rownum = 1
 17  /

DIAGNOSIS_NAME
--------------------
Right Knee Operation

SQL>

Hi @Lalit, your query working fine but user may also enter **operation knee rt** instead of **operation knee right** as well. Left=lt, Right=lt etc., — shary.sharath, Apr 29 '15 at 06:52
@shary.sharath You could use `DECODE` to customise few things. The feature gives you the most close match, but there are few things which are beyond the scope of the feature. You need to explicitly write the query to handle such cases where there might be an ambiguity. For example, `operation knee rt` will give two rows as both are close match. But, if you decode `rt` as `right` and `lt` as `left`, you will get the desired output. Please mark it as answered. — Lalit Kumar B, Apr 29 '15 at 06:58
Hello @Lalit, I have used your query `SELECT diagnosisname FROM ( SELECT diagnosisname,utl_match.edit_distance_similarity(UPPER(diagnosisname),UPPER('knee operation Left ')) eds, UTL_MATCH.jaro_winkler_similarity (UPPER(diagnosisname),UPPER('knee operation Left ')) jws FROM icd_code ORDER BY jws DESC ) WHERE ROWNUM = 1` but It's returning **Knee operation** instead of **Left Knee Operation**. what should I do now? — shary.sharath, Apr 29 '15 at 09:03
Hmm, that's because `Knee Operation` was a perfect match for the first two words. Let me give you a workaround, meanwhile you could try this http://lalitkumarb.wordpress.com/2014/12/02/split-comma-delimited-string-into-rows-using-oracle-sql/ You can split the string into individual words and do a comparison using `IN` operator. — Lalit Kumar B, Apr 29 '15 at 09:07
for word to word matching in sentence jaro or edit which will be better? — Balasubramanian Ramamoorthi, Dec 27 '21 at 19:23

score 0 · Answer 2 · answered Apr 29 '15 at 06:06

0

Please try this query. This may help to solve your problem.

SELECT diagnosisname 
   FROM (SELECT diagnosisname, UTL_MATCH.jaro_winkler_similarity('%operation Knee right%',diagnosisname) 
   FROM icd_code 
   WHERE UTL_MATCH.jaro_winkler_similarity('%operation Knee right%',diagnosisname) = 100
   ORDER BY UTL_MATCH.EDIT_DISTANCE_SIMILARITY('%operation Knee right%',diagnosisname) DESC) 
WHERE ROWNUM<2

answered Apr 29 '15 at 06:06

Pankaj K

770
7
20

Hello @Pankaj, you are comparing **UTL_MATCH.jaro_winkler_similarity('%operation Knee right%',diagnosisname) = 100** which means first string and second string both should match completely. But my requirement is like even if it matches partially (but almost) should work.
But your query returning nothing. – shary.sharath Apr 29 '15 at 06:30

How to get almost matching string from Oracle table?

2 Answers2