2

I have a table with approximately 7 million records. The table has a first_name and last_name column which I want to search on using the levenshtein() distance function.

select levenshtein('JOHN', first_name) as fn_distance,
       levenshtein('DOE', last_name) as ln_distance,
       id,
       first_name as "firstName",
       last_name as "lastName"
  from person
 where first_name is not null
   and last_name is not null
   and levenshtein('JOHN', first_name) <= 2
   and levenshtein('DOE', last_name) <= 2
 order by 1, 2
 limit 50;

The above search is slow (4 - 5 secs), what can I do to improve performance? Should a create indexes on the two columns, or something else?

After I added indexes below:

create index first_name_idx on person using gin (first_name gin_trgm_ops);

create index last_name_idx on person using gin(last_name gin_trgm_ops);

The query now takes ~11 secs. :(

New query:

select similarity('JOHN', first_name) as fnsimilarity,
       similarity('DOW', last_name) as lnsimilarity,
       first_name as "firstName",
       last_name as "lastName",
       npi
  from person
 where first_name is not null
   and last_name is not null
   and similarity('JOHN', first_name) >= 0.2
   and similarity('DOW', last_name) >= 0.2
 order by 1 desc, 2 desc, npi
 limit 50;
user1452701
  • 144
  • 1
  • 10
  • 1
    The `levenstein()` function has two input string parameters, one may correspond to a column of the table `person`, either `first_name` or `last_name`, the other one correspond to the value you want to calculate the distance against. This value is not a constant or a parameter known when the index is calculated at the insert or update events on the table `person`. So I don't see how you can use this function in an index. – Edouard Jan 04 '22 at 21:38
  • You can consider the full text search capability instead, and which can rely on indexes, see the [manual](https://www.postgresql.org/docs/14/textsearch-intro.html) – Edouard Jan 04 '22 at 21:55

1 Answers1

3

There is no built-in index type that supports levenshtein distances. I'm not aware of any 3rd party index implementations to do so either.

Another string similarity measure, trigram similarity, does have an index method to support it. Maybe you can switch to using that measure instead.

You need to write the query using the % operator, not the similarity function. So it would look something like this:

set pg_trgm.similarity_threshold TO 0.2;
select similarity('JOHN', first_name) as fnsimilarity,
       similarity('DOW', last_name) as lnsimilarity,
       first_name as "firstName",
       last_name as "lastName",
       npi
  from person
 where first_name is not null
   and last_name is not null
   and 'JOHN' % first_name
   and 'DOW' % last_name
 order by 1, 2, npi
 limit 50;

But note that 0.2 is very low cutoff, and the lower the cutoff the less efficient the index.

jjanes
  • 37,812
  • 5
  • 27
  • 34