Fuzzy text searching in Oracle

Question

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.

Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.

What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?

score 11 · Accepted Answer · answered Aug 15 '14 at 07:00

UTL_MATCH contains methods for matching strings and comparing their similarity. The edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance relative to the size of the strings.

--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
    --Rank edit ratios.
    select substring, address, edit_ratio
        ,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
    from
    (
        --Calculate edit ratio - edit distance relative to string sizes.
        select
            substring,
            address,
            (length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
        from
        (
            --Fake addreses (from http://names.igopaygo.com/street/north_american_address)
            select '526 Burning Hill Big Beaver District of Columbia 20041'   address from dual union all
            select '5206 Hidden Rise Whitebead Michigan 48426'                address from dual union all
            select '2714 Noble Drive Milk River Michigan 48770'               address from dual union all
            select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
            select '968 Iron Corner Wacker Arkansas 72793'                    address from dual
        ) addresses
        cross join
        (
            --Address substrings.
            select 'Michigan'           substring from dual union all
            select 'Not-So-Hidden Rise' substring from dual union all
            select '123 Fake Street'    substring from dual
        )
        order by substring, edit_ratio desc
    )
)
where edit_ratio_rank = 1
order by substring, address;

These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.

SUBSTRING          ADDRESS                                                  EDIT_RATIO
---------          -------                                                  ----------
123 Fake Street    526 Burning Hill Big Beaver District of Columbia 20041   0.5333
Michigan           2714 Noble Drive Milk River Michigan 48770               1
Michigan           5206 Hidden Rise Whitebead Michigan 48426                1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426                0.5

I found this EDIT_DISTANCE_SIMILARITY and JARO_WINKLER_DISTANCE similarity functions already and it seems they do exactly what I want. Thanks for your answer. — Ufuk Can Bicici, Aug 15 '14 at 12:40

Axel Kemper · Answer 2 · 2014-08-12T09:52:37.557

You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.

Edited: If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.

Example: A Turkish SOUNDEX is promoted here.

To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:

Simplified example rules:

Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...

The problem is SOUNDEX seems to be built for English language. My data is not in English. — Ufuk Can Bicici, Aug 12 '14 at 09:26

Fuzzy text searching in Oracle

2 Answers2