-1

I want to preface this by saying I am new to SQL. I have been tasked with trying to find the fuzzy match score of two columns. This will help my team discover which data is still relevant and which data is outdated (as the fuzzy match will not return a high score since the data is not in the second column).

Now for the problem. I have two tables. Let's call them Table_A and Table_B. Table_A has a column called combo_field that looks like this.

combo_field
test1
example2
field3

Table_B has a similar column named combo, which looks like this.

combo
EXAMPLE3
TEST1
FIELD2

I want my output to look something like this.

fuzzy match score combo_field combo
.1 test1 TEST1
.2 example2 EXAMPLE2
.1 field3 FIELD3

Is it possible to both sort the combo_field and combo columns so that the values are lined up correctly, and then also return a fuzzy match score? I'm not sure if this can be done together, if it will take two steps to solve, or if it can be done at all. Any guidance would be much appreciated.

EDIT:

I brought the two combo fields I had in SQL server to Microsoft Excel. I downloaded fuzzy match through Microsoft Excel to match my two columns up and generate a fuzzy match score. It both sorted the columns to line up similar rows and it gave me a fuzzy match score between 1 and 0.

  • You don't tell us enough to answer -- what other fields do these two tables have (is there an id that would allow you to join them?) how is the fuzzy match calculated. There is a lot here that you are assuming we understand but you have not explained. – Hogan Jun 01 '22 at 22:59
  • this is not really an SQL problem. SQL could give you a Cartesian product of all values in the first column joined to all values in the second column, but so could any other programming language really. calculating the match score involves some algorithm or function on each pair – Kurt Jun 01 '22 at 23:04

1 Answers1

0

There's nothing you can do with sorting to help here. You simply have to generate every pair of values from both columns (Cartesian join), run the score function on every pair, and filter out the ones below your desired threshold. Something like:

with
    scores as (
        select
            score(a.combo_field, b.combo) as fuzzy_match_score,
            a.combo_field,
            b.combo
        from
            table_a a
            cross join
            table_b b
    )
select
    *
from
    scores
where
    fuzzy_match_score >= .1

The caveat here is performance if the tables are very large and/or the score function is slow. The exact syntax for a "cross join" might vary by dbms also.

As the other commenter also pointed out, if there are any other columns available to join the tables together, that would be more efficient than the Cartesian.

Kurt
  • 1,708
  • 1
  • 3
  • 11