This is a problem that comes up often in my work with various different data sets, so please excuse me presenting it in general terms rather than using a specific example.
I often need to get records from a large table (generally 100s of millions of rows) where a text column is similar to a column in a much smaller table (10s to 100s of rows). My current approach is as follows, where targets
is the smaller table and matches
the larger one.
set pg_trgm.similarity_threshold = .9;
select *
from targets as t
inner join matches as m on
t.name % m.name;
matches.name
will have a GIN index and will generally have a relatively high degree of uniqueness, with perhaps 10-20% of records being duplicates. Both matches.name
and targets.name
are almost always less than 50 characters long, and often much shorter.
As I understand it, this is a slightly unusual use-case: the Postgres documentation and most SO answers seem to focus on optimising for matching against a single value. So I'd be keen to hear thoughts on two questions:
- In very general terms (tens of minutes, hours, etc.), and assuming the database is configured optimally, what's a reasonable aim for this type of query in terms of performance, given, say, 300 targets and 300 million potential matches?
- Is the strategy I'm using at the moment the most efficient one given the parameters? Would it be worth trying a GiST index and taking the top n matches for each row using the
<->
operator instead, for example? Are there completely different approaches that could be more efficient?
Thanks in advance for your help!