TSQL Fuzzy address matching grouping, 2019 Edition

Question

I have this situation where people asked to group on bad addresses. And I need to work on the tools/env I have, I don't have choice for Google API or 3rd party Data Science tools. I also did my HW, see posts several years old, so still want to check all if any updates available. In my scenario people want to group IDs 1-6 into single, rest I added for neg test.

SELECT * INTO #t FROM ( --test data:   select * from #t          drop table #t
    SELECT 1 Id,  '1 CROLANA HEIGHTS' Adr UNION   -- A vs O
    SELECT 2 Id, '1 CROLONA HEIGHTS' Adr union
    SELECT 3 Id, '1 CROLONA HEIGHT DRIVE' Adr union
    SELECT 4 Id,'1 CROLONA HEIGHTS DR' Adr union
    SELECT 5 Id, '1 CROLONA HGHTS DR' Adr union
    SELECT 6 Id, '1 CROLONA HTS DR' Adr  UNION
    ---------------------------------------- rest should not match
    SELECT 7 Id, '1 CORWING DR' Adr  UNION  
    SELECT 8 Id, '1 SUNNYHILL DRIVE' Adr  UNION
    SELECT 9 Id, '1 CROWN HILL DR' Adr  UNION
    SELECT 10 Id, '1 ADDISON DRv' Adr  ) a 
-------------------  and below is my fuzzy working script which can be improved)



SELECT  id,   adr, LEAD(adr,1) OVER ( ORDER BY adr ) adr_lead,
        SOUNDEX(adr) Sdx,  DIFFERENCE(adr, LEAD(adr,1) OVER ( ORDER BY adr )) diff
    ---     SOUNDEX(adr), COUNT(*) c
FROM #t 
--GROUP BY  SOUNDEX(adr)
WHERE SOUNDEX(adr) = SOUNDEX('1 CROLANA HEIGHTS')

[`Soundex`](https://en.wikipedia.org/wiki/Soundex) is designed for matching names. Used with strings that start with a digit the value returned will always be `'0000'` and the `Difference` will be `4`. [This](https://stackoverflow.com/questions/55787592/difference-function/55790300#55790300) question is rather similar and quite recent. — HABO, Apr 26 '19 at 01:03
One approach might be first to use several `REPLACE` to change all variants of `HEIGHTS` to just one, same for `DRIVE`. It will take some hours to get through this mess, but I'm pretty sure, that this would at least reduce the problem a lot. For a fuzzy search I'd suggest to split this in fragments and compare them one by one. — Shnugo, Apr 26 '19 at 08:34

Mike S · Answer 1 · 2019-04-26T13:48:49.860

0

There is suggestions which I gladly take. I'm using intell replace at the end of string and standalone words to improve data.

    DECLARE @st VARCHAR(100) = 'La_Beg_10 La_midleMacy La'  --replace et the end of string

SELECT 'ryba', @st, '-->' f, CASE WHEN @st LIKE '%' + ' La' 
      THEN SUBSTRING(@st,1,LEN(@st) - LEN('La')) + 'Lane' ELSE @st END N

edited Apr 26 '19 at 13:48

answered Apr 26 '19 at 13:46

Mike S

296
2
14

This post doesn't look like an attempt to answer this question. Every post here is expected to be an explicit attempt to *answer* this question; if you have a critique or need a clarification of the question or another answer, you can [post a comment](//stackoverflow.com/help/privileges/comment) (like this one) directly below it. Please remove this answer and create either a comment or a new question. See: [Ask questions, get answers, no distractions](//stackoverflow.com/tour) – double-beep Apr 26 '19 at 13:47
There is no solution for this question, so it's just closed – Mike S Apr 26 '19 at 19:48

TSQL Fuzzy address matching grouping, 2019 Edition

1 Answers1