0

How would one go about doing a fuzzy name search in Oracle?

For example:

Our data system has the preferred mailing as:

Mr. Nicolas Jim Raleigh

But in Facebook, or other search field, the name passed to the algorithm is:

Nick Jim Raleigh

The process would run the search name against all of the preferred names, then return the result that contains the most matching characters:

Mr. [Nic]olas Jim Raleigh

[Nic]k Jim Raleigh

16 out of my searched name's 17 characters appear in the preferred name, and we could return a ranked suggestion.

[EDITED TO ADD]

After initial suggestion, and reading of Oracle's Text Query options I have created an index on the table

create index ADD_EX_INDX3 on address_extract(pref_mail_name) 
  indextype is ctxsys.context 
  parameters ('DATASTORE CTXSYS.DEFAULT_DATASTORE');

and now am able sucessfully retireve

select score(1), ae.pref_mail_name
from address_extract ae
 where contains(pref_mail_name,'fuzzy(raleigh,,,weight)',1) > 0
order by score(1) desck

Which returns

100 Mr. Raleigh H. Jameson
100 Mr. Nicolas Jim Raleigh
100 Ms. Susanne M. Raleigh
66  Mrs. LaReign Smith
66  Ms. Rahil Smith
62  Mr. Smith  Ragalie

I am struggling to to a full name search however. How would I go about doing the full name?

APC
  • 144,005
  • 19
  • 170
  • 281
Lloyd
  • 1,395
  • 6
  • 20
  • 37
  • 1
    This looks like you've made no effort on your own at all. Searching for "fuzzy text search oracle" came up immediately with the [contains](http://docs.oracle.com/cd/B13789_01/text.101/b10730/cqoper.htm) operator and the second indicated a more [general overview](http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm); linked in the docs is a description of a [ctxcat index](http://docs.oracle.com/cd/B28359_01/text.111/b28303/ind.htm#i1007263), which you'll need. Have you attempted any of this? What problems did you have in doing so? – Ben Jan 11 '13 at 17:12
  • Thanks for the prompt response. My understanding of "contains" is that my example will not work. Since 'Nick Balcom Raleigh' is not contained in 'Nicolas Balcom Raleigh'. You are correct, I perhapse did not understand the question I needed to ask. I am making headway with looking into ' Levenshtein distance formula' – Lloyd Jan 11 '13 at 17:24
  • I am hesitant to ask after your first condescending response, but I am really struggling to get the fuzzy search to work with the multiple terms. I am going to clarify [edit] my question and see if I can get some pointers. – Lloyd Jan 11 '13 at 18:03
  • possible duplicate of [Matching Oracle duplicate column values using Soundex, Jaro Winkler and Edit Distance (UTL\_MATCH)](http://stackoverflow.com/questions/8229436/matching-oracle-duplicate-column-values-using-soundex-jaro-winkler-and-edit-dis) – APC Jan 11 '13 at 23:35
  • Hi Lloyd, I'm sorry if I came across as condescending; it wasn't intentional. I was in a hurry to dump as much info into a comment as possible. I did want to point out how important it is to _try_ on your own; it stops users from duplicating your own efforts and encourages people to help. I came back to point you to UTL_MATCH and Jaro Winkler but I see @APC has done that already. The problem with these is indexes... There are two suggestions I have really. One is to create a list of "nicknames" i.e. Nick to Nicholas, and then you can search for both. – Ben Jan 12 '13 at 12:17
  • The second is to look at Lucerne and Solr, though they may be totally pointless from your perspective it depends on exactly what you are doing. – Ben Jan 12 '13 at 12:18

1 Answers1

5

Name matching is hard. Oracle's Text indexing supports fuzzy matching and stemming, which is a start, but consider these names:

  • Nicholas Raleigh
  • Nihcolas Raleigh
  • Nico Raleigh
  • Nik Raleigh
  • Nicky Raleigh
  • Nick Raleigh
  • Nikolaus Raleigh
  • Nicola Raleigh
  • Nikki Raleigh
  • Nikola Raleigh
  • Nikolai Raleigh
  • Nikolaj Raleigh

Attempting to match those through abstractions, be it Levenshtein Distance or Double Metaphone, is going to generate false positives and false negatives. This is the nature of abstraction. The best way to get a focused and accurate result set is with a thesaurus (and even this isn't perfect). Unfortunately, assembling a comprehensive thesaurus of names is a gigantic undertaking; to get a sense of the task check out the stats on the NameX site.


Update: Oracle 11gR2 includes an extension to Oracle Text tailored to name searching. This is highly neat, and definitely the first place to start. Find out more.

APC
  • 144,005
  • 19
  • 170
  • 281