Oracle fuzzy text search with wildcards

Question

I've got a SAP Oracle database full with customer data. In our custom CRM it is quite common to search the for customers using wildcards. In addtion to the SAP standard search, we would like to do some fuzzy text searching for names which are similar to the entered name. Currently we're using the UTL_MATCH.EDIT_DISTANCE function to search for similar names. The only disadvantage is that it is not possible to use some wildcard patterns.

Is there any possiblity to use wildcards in combination with the UTL_MATCH.EDIT_DISTANCE function or are there different(or even better) approaches to do that?

Let's say, there are the following names in the database:

PATRICK NOR
ORVILLE ALEX
OWEN TRISTAN
OKEN TRIST

The query could look like OKEN*IST* and both OWEN TRISTAN and OKEN TRISTAN should be returned. OKEN would be a 100% match and OWEN less.

My current test-query looks like:

SELECT gp.partner, gp.bu_sort1, UTL_MATCH.edit_distance(gp.bu_sort1, ?) as edit_distance, 
      FROM but000 gp
      WHERE UTL_MATCH.edit_distance(gp.bu_sort1, ?) < 4

This query works fine except if wildcards * are used within the search string (which is quite common).

Can you add examples of how you want to combine fuzzy searching with wildcards? For example, are you looking for something like this: "abc*" would match "abDefghijklmnop" better than "a12"? This would only fuzzy match the first 3 characters - anything after the star should not count against the score. — Jon Heller, Feb 04 '17 at 03:26

score 3 · Accepted Answer · answered Feb 07 '17 at 10:52

Beware of the implications of your approach in terms of performances. Even if it "functionally" worked, with UTL_MATCH you can only filter the results obtained by an internal table scan.
What you likely need is an index on such data.
Head to Oracle Text, the text indexing capabilities of Oracle. Bear in mind that they require some effort to be put at work.

You might juggle with the fuzzy operator, but handle with care. Most oracle text features are language dependent (they take into account the English dictionary, German, etc..).

For instance

-- create and populate the table
create table xxx_names (name varchar2(100));

insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');

--create the domain index
create index xxx_names_ctx on xxx_names(name) indextype is ctxsys.context;

This query would return results that you'd probably like (input is the string "TRST")

select
  SCORE(1), name
from
  xxx_names n
where
  CONTAINS(n.name, 'definescore(fuzzy(TRST, 1, 6, weight),relevance)', 1) > 0
; 



  SCORE(1) NAME               
---------- --------------------
         1 OWEN TRISTAN        
        22 OKEN TRIST

But with the input string "IST" it would likely return nothing (in my case this is what it does). Also note that in general, inputs of less than 3 characters are considered non-matching by default.
You'll possibly get a more "predictable" outcome if you take off the "fuzzy" requirement and stick to finding rows that just "contains" the exact sequence you passed in.
In this case try using a ctxcat index, which, by the way supports some wildcards (warning: supports multi columns, but a column cannot exceed 30 chars in size!)

-- create and populate the table
--max length is 30 chars, otherwise the catsearch index can't be created
create table xxx_names (name varchar2(30));

insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');

begin

ctx_ddl.create_index_set('xxx_names_set');
ctx_ddl.add_index('xxx_names_set', 'name'); 

end;
/

drop index xxx_names_cat;
CREATE INDEX xxx_names_cat ON xxx_names(name) INDEXTYPE IS CTXSYS.CTXCAT
PARAMETERS ('index set xxx_names_set');

The latter, with this query would work nicely (input is "*TRIST*")

select
  UTL_MATCH.edit_distance(name, 'TRIST') dist,
  name
from
  xxx_names
where
  catsearch(name, '*TRIST*', 'order by name desc') > 0
;

      DIST NAME               
---------- --------------------
         7 OWEN TRISTAN        
         5 OKEN TRIST

But with the input "*O*TRIST*" wouldn't return anything (for some reasons).

Bottom line: text indexes are probably the only way to go (for performance) but you have to fiddle quite a bit to understand all the intricacies.

References:

fuzzy search: Oracle Text CONTAINS Query Operators
catsearch : Oracle Text SQL Statements and Operators

Bohemian · Answer 2 · 2017-02-06T07:22:02.473

2

Assuming "wildcard" means an asterisk, you want a name that matches all specified letters to rank highest, with more specified letters matching better than less, otherwise rank by edit distance similarity.

using the placeholder ? for your search term, try this:

select *
from mytable
order by case
      when name like '%' || replace(?, '*', '%') || '%' then 0 - length(replace(?, '*', ''))
      else 100 - UTL_MATCH.edit_distance_similarity(?, name) end
fetch first 10 rows

FYI all "like" matches have a negative number for their ordering with magnitude the number of letters specified. All like misses have a non-negative ordering number with magnitude of the percentage difference. In all cases, a lower number is a better match.

edited Feb 06 '17 at 07:22

answered Feb 04 '17 at 14:14

Bohemian

412,405
93
575
722

I've added an example at the bottom of my question. – Florian Feb 06 '17 at 06:31
Your answer just selects all names matching the pattern and order them by the edit_distance. I want to parts of the pattern which are not `*` to be used for fuzzy searching. Your pattern matches them exactly. See example above: `OKEN*IST*` should also match `OWEN TRISTAN`. – Florian Feb 06 '17 at 06:39
Why should `OKEN*IST*` match `OWEN TRISTAN`? What logic makes a `K` match a `W`? I could understand `O*EN*IST*` matching, but not when the user has specified the `K`. – Bohemian Feb 06 '17 at 06:44
When searching names in a front office, it is often quite hard to understand the name correctly (especially if it comes to turkish, ... names). Therefor the fuzzy search. – Florian Feb 06 '17 at 06:46
But you need to define exactly what "fuzzy" means. Again I ask, what logic makes a `K` match a `W`. And if `K` matches `W`, why shouldn't `SMITH` match `JONES` then? – Bohemian Feb 06 '17 at 06:48
Because `SMITH` and `JONES` would have a edit_distance of 0%. I would like to select lets say the top 10 results ordered by the edit_distance. If posted my current test-query for better understanding. – Florian Feb 06 '17 at 06:54
Try the new version of the query – Bohemian Feb 06 '17 at 07:16
Looks good so far. Is it possible to include some kind of value which represents the "score" or edit_distance? Is it possible to increase the Performance by adding a index? – Florian Feb 06 '17 at 07:58
The "score" could be a version of the order by term (try `200 - `), but without a clear definition of what "score" means (with examples) it's hard to say. It is unlikely that an index would help, because edit distance must be run over all values, but try putting one on the name field and see what happens... especially try putting a *covering* index on the name column plus those columns returned from your search query. – Bohemian Feb 06 '17 at 08:08
Ok, I'll try that. One (hopefully) last question: What would be the correct way for selecting on 2 or more columns (let's say firstname and lastname)? – Florian Feb 06 '17 at 08:11
I would use `firstname || ' ' || lastname` where I have coded `name`, that way `JOHN S*` will be a better match than `JOHNS*` when trying to match `JOHN SMITH` rather than `JOHNSON`, but try a few test cases with real data to see what feels like the best result for the user's search intention. Ask users what they would expect when searching with specific search terms – Bohemian Feb 06 '17 at 08:18
So you would concatenate the first- and lastname into one string and compare it to the selection Input? The problem is, that users can explicitly search for a firstname, a lastname or both. – Florian Feb 06 '17 at 08:56
@FlorianR. Searching against the concatenation of both name parts will return hits with expected precedence for all search term styles (ie just first/last or all at once). Try it and see for yourself. – Bohemian Feb 06 '17 at 16:19

Oracle fuzzy text search with wildcards

2 Answers2