0

First of all and before detailing the problem I'm dealing with let me tell you that I'm currently an SQL-newbie so whenever it's possible I'll appreciate plain explanations and simple solutions. Here's what I have:

Given this query:

        SELECT 
        table1.id as id, 
        table1.tag1 as tag1, 
        table2.tag2 as tag2,
        table2.tag2 like '%'+table1.tag1'%' as match


        FROM table1
        INNER JOIN table2
        ON table1.id = table2.id

I'm getting this table:

id     tag1         tag2           match
1      ice cream    ice-cream      false
2      sweets       sweets         true
3      bakery       bakery         true
4      sweets       ice-cream      false

The problem I want to solve is that I'd want "match" column to interpret as "true" similar words as those from the first row. Therefore in my desired output I'd like this mentioned cell to be "true" instead of "false".

Thanks in advance.

teogj
  • 289
  • 1
  • 11

3 Answers3

2

"Similar" can be estimated in a fair few ways. Some methods are...elaborate. A good place to start is with "edit distance." This is also named "Levenshtein distance" after its creator. The idea is pretty easy to understand, and the results make sense. (It also sounds pretty much like what you're asking for.) While there are variations, the basic idea is to count up how many characters you need to change to convert one string into another. So "ice cream" to "ice-cream" requires one change. That's close. "ice cream" to "nice dream" takes a lot more changes. You can look up the algorithm and find plenty about it with good examples. Closer to home, spell-checkers have traditionally had this algorithm in their bag of tricks. That's one way that they can suggest similar words with different roots.

Levenshtein is not enabled in Postgres by default, but it is included in a standard extension named "fuzzystrmatch":

https://www.postgresql.org/docs/current/fuzzystrmatch.html

That extension also includes some "phonetic" matching algorithms, which aren't so much what it sounds like you're after. Depending on how you're deploying, there's another extension with a bunch of fuzzy string matching tools but, honestly, I'd start with Levenshtein anyway.

https://github.com/eulerto/pg_similarity

If you end up on RDS, pg_similarity is supported.

Other suggestions you'll likely hear include LIKE, regex, and trigrams (great!, but a bit more involved).

Fuzzy string matching is a big subject, and super interesting. If you pursue this further, it will help to know what kind of record counts you're dealing with, how long your strings are (shorter string are harder to fuzzy compare as there's not as much to work with), your Postges version, etc.

Morris de Oryx
  • 1,857
  • 10
  • 28
0

You need to decide exactly what you want as a match. Let me assume that a space can match any character. Then use:

table2.tag2 like '%' + replace(table1.tag1, ' ', '_') + '%' as match

Alternatively, you might want to remove all spaces and hyphens for the comparison:

replace(replace(table2.tag2, ' ', ''), '-', '') like '%' + replace(replace(table1.tag1, ' ', ''), '-', '') + '%'
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

@Morris'es answer is good for Postgres. In you are on Refshift you can create a Python UDF for fuzzy match that takes 2 strings as input and returns either a binary judgement or some measure of similarity between these strings. Here is a good example of Levenshtein algorithm implementation with Python UDF: Periscope community thread

The function returns the string "distance" between two words (how many characters are different).

You can use the output as levenshtein(table2.tag2,table1.tag1)<=1 as match

AlexYes
  • 4,088
  • 2
  • 15
  • 23