16

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.

Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.

What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?

Ufuk Can Bicici
  • 3,589
  • 4
  • 28
  • 57

2 Answers2

11

UTL_MATCH contains methods for matching strings and comparing their similarity. The edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance relative to the size of the strings.

--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
    --Rank edit ratios.
    select substring, address, edit_ratio
        ,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
    from
    (
        --Calculate edit ratio - edit distance relative to string sizes.
        select
            substring,
            address,
            (length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
        from
        (
            --Fake addreses (from http://names.igopaygo.com/street/north_american_address)
            select '526 Burning Hill Big Beaver District of Columbia 20041'   address from dual union all
            select '5206 Hidden Rise Whitebead Michigan 48426'                address from dual union all
            select '2714 Noble Drive Milk River Michigan 48770'               address from dual union all
            select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
            select '968 Iron Corner Wacker Arkansas 72793'                    address from dual
        ) addresses
        cross join
        (
            --Address substrings.
            select 'Michigan'           substring from dual union all
            select 'Not-So-Hidden Rise' substring from dual union all
            select '123 Fake Street'    substring from dual
        )
        order by substring, edit_ratio desc
    )
)
where edit_ratio_rank = 1
order by substring, address;

These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.

SUBSTRING          ADDRESS                                                  EDIT_RATIO
---------          -------                                                  ----------
123 Fake Street    526 Burning Hill Big Beaver District of Columbia 20041   0.5333
Michigan           2714 Noble Drive Milk River Michigan 48770               1
Michigan           5206 Hidden Rise Whitebead Michigan 48426                1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426                0.5
Jon Heller
  • 34,999
  • 6
  • 74
  • 132
  • I found this EDIT_DISTANCE_SIMILARITY and JARO_WINKLER_DISTANCE similarity functions already and it seems they do exactly what I want. Thanks for your answer. – Ufuk Can Bicici Aug 15 '14 at 12:40
5

You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.

Edited: If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.

Example: A Turkish SOUNDEX is promoted here.

To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:

Simplified example rules:

  1. Convert all characters to lowercase
  2. Remove "str." at the end of a name
  3. Remove "drv." at the end of a name
  4. Remove "place" at the end of a name
  5. Remove "ave." at the end of a name
  6. Sort names with multiple words alphabetically
  7. Drop auxiliary words like "of", "and", "the", ...
Axel Kemper
  • 10,544
  • 2
  • 31
  • 54