2

I'm trying to perform a join over two tables that contain info about the same companies, but sometimes the companies are stored with slightly different names (e.g. table 1: Company X -> Table 2: Company X and Friends). My idea was to full join each row of table 1 with each row of table 2 and then using ~ to filter (e.g. where name1 ~ name2 or name2 ~ name1), but that's not possible.

Do someone has a solution to this matching problem? Thanks!

3 Answers3

4

You can use the "<->" operator or the '%' operator provided by pg_trgm to do fuzzy string matching.

You have to make decisions like what similarity cutoff you are willing to use for '%' (return no matches if the best match is below that) and if you want to return only the top match (lateral joins are good for this) or possible more than one.

It would likely be best to use this to clean your data, rather than incorporating it into a production app directly.

You might also need to strip out nuisance words. "Company X and friends" will be more similar to "Company Y and friends" than it will be to "Company X limited".

SELECT t1.*, t2.*, 1- (t1.name <-> t2.name) as similarity
FROM table1 t1
INNER JOIN table2 t2
    ON t1.name % t2.name
order by t1.name <-> t2.name;
jjanes
  • 37,812
  • 5
  • 27
  • 34
1

You could join on the condition that either name field be a substring of the other's name field:

SELECT t1.*, t2.*
FROM table1 t1
INNER JOIN table2 t2
    ON t1.name LIKE '%' || t2.name || '%' OR
       t2.name LIKE '%' || t1.name || '%';

This approach does not even require regex. We could use regex here, if we wanted to ensure that one table's name only appears as a substring of the other's name and is also a word. But, maybe you don't even need to do this.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

You could

CREATE EXTENSION pg_trgm;

and use the similarity operator it provides in the join condition:

t1 JOIN t2 ON t1.name % t2.name
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263