0

I wanted to combine PostgreSQL Levenshtein and trigram similarity functions. The main advantage of the trigram similarity function is that it can utilize GIN or GIST indexes and thus can return fuzzy match results quickly. However, if it is called inside another function, it does not use indexes. For sake of this problem illustration, here is a plpgsql function "trigram_similarity" that calls original trigram's "similarity" function.

CREATE OR REPLACE FUNCTION public.trigram_similarity(
    left_string text,
    right_string text)
  RETURNS real AS
$BODY$
BEGIN
  RETURN similarity(left_string, right_string);
END;$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT
  COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
  OWNER TO postgres;

Although the function actually just calls the trigram's similarity function, it behaves completely different when it comes to GIN indexes utilization. While the original trigram's similarity function inside WHERE clause of a query does utilize GIN indexes and thus a query returns result quickly and without much RAM consumption, when using trigram_similarity it does not. For large datasets fuzzy match analysis, the RAM is completely used and application freezes... For sake of illustration, here is an example query:

SELECT DISTINCT  
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score 
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
::numeric(5,4) AS trigram_similarity_weighted_score  
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
 FROM  "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
 WHERE  
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
 AND 
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
 AND 
(((l.l_composite_18 % r.r_composite_18)
)
 OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
 >= 0.7) 
 OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
 >= 0.7)
 ) ORDER  BY trigram_similarity_composite_score DESC;

This query causes RAM clottage and application freezes. When "trigram_similarity" is replaced with "similarity", the query executes fast and without RAM overconsumption. Why "trigram_similarity" and "similarity" behave differently? Is there a way I could force GIN or GIST indexes utilization for this "trigram_similarity" function or any other function calling trigram's similarity function inside?

Explain analyze when "similarity" is used:

"Unique  (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
"  Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
"  Buffers: shared hit=2513871 read=4158"
"  ->  Sort  (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
"        Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
"        Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
"        Sort Method: quicksort  Memory: 76kB"
"        Buffers: shared hit=2513871 read=4158"
"        ->  Nested Loop  (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
"              Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
"              Buffers: shared hit=2513871 read=4158"
"              ->  Seq Scan on public.r_rightdatasetexample3214274191 r  (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
"                    Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
"                    Filter: ((r."ProgramCode")::text = '1'::text)"
"                    Buffers: shared hit=5800 read=4158"
"              ->  Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l  (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
"                    Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
"                    Index Cond: ((l."Name")::text = (r."Name")::text)"
"                    Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
"                    Rows Removed by Filter: 50"
"                    Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"
zlatko
  • 596
  • 1
  • 6
  • 23
  • 1
    Can you show us the `EXPLAIN ANALYZE` results when you run your query with the plain `similarity()` function? In my (smaller) test case it doesn't trigger the index usage, only the `%`, `<->` and `~~` (like) operators do. – pozs May 04 '16 at 15:42
  • Hi, I added explain analyze text in the main message. – zlatko May 05 '16 at 07:36
  • That's more interesting, `similarity()` is not used in the index (as expected), just the `%` operator (for the `OR (((l."Name" % r."Name") ...` WHERE part), so the statement with your custom function should use it as well. – pozs May 05 '16 at 08:31
  • But still the other query ("similarity" replaced by "trigram_similarity") behaves completely different in real life. Unfortunately, I can't provide explain anlayze text since application freezes :-( – zlatko May 05 '16 at 09:17
  • 1
    I think I have found the reason - when I change "COST 100;" to "COST 1" the other query works fine as well. – zlatko May 05 '16 at 10:34

2 Answers2

1

The indices are created on table columns. You need to modify your plpgsql function to query a GIN- or GIST-indexed, table column rather than comparing two, string literals. If you compare two, string literals the plugin has no index to hit and must decompose both strings into their trigrams before comparing them, which is your problem.

http://www.postgresql.org/docs/9.1/static/pgtrgm.html

Dylon
  • 1,730
  • 15
  • 14
  • Sorry, but I don't understand what you mean? I do query columns, not literals, and for all the columns included in the query, there are GIN indexes set. Can you explain what is wrong with the query above? – zlatko May 12 '16 at 15:32
  • Postgres passes the literal text from the columns to your function, which has no context on their sources. Since your function does not know where its parameters come from, it does not know how to utilize your indices (all it knows is that it needs to return the distance between two, string literals). – Dylon May 12 '16 at 16:18
  • Hi, Dylon, could you give an example? Or suggest how to modify the above specified trigram_similarity(left_string text, right_string text) RETURNS real function? Thanks. – zlatko May 13 '16 at 06:48
  • You could modify `trigram_similarity` to accept instances of table rows, and play around with accessing the rows' column values by interpolating the columns' names, but that would be really messy and I'm not sure how strongly Postgres would utilize the indices without hints. Why can't you use the `similarity` function instead of `trigram_similarity`? – Dylon May 13 '16 at 09:22
  • similarity function has same parameters signature: CREATE OR REPLACE FUNCTION public.similarity( text, text) RETURNS real AS '$libdir/pg_trgm', 'similarity' LANGUAGE c IMMUTABLE STRICT COST 1; ALTER FUNCTION public.similarity(text, text) OWNER TO postgres; – zlatko May 13 '16 at 09:24
  • This feels like an interview question, lol. Why is it necessary to replace `similarity` with `trigram_similarity`, which merely proxies to the former? As it stands, the function has no idea where the parameters came from and thus no clue how to utilize the original columns' indices. If you see the text, "Jerry", can you tell me (without further context) whether it's a street, name, town, or county? If not, then which index should you hit? I can think of some very hacky solutions, but nothing cleaner or more efficient than just using `similarity`. – Dylon May 13 '16 at 09:42
  • Are you saying that because `similarity` has the same signature as `trigram_similarity`, then the latter should be able to utilize indices as well? That solution lies in the underlying implementation of `similarity`: http://doxygen.postgresql.org/dir_fa13a57cd5631a1c7195dac44ee371e6.html – Dylon May 13 '16 at 09:53
  • The author of the `pg_trgm` extension has written `similarity` in C using Posgres' C API, which gives him access to a lot of information about where the parameters originate. – Dylon May 13 '16 at 09:55
0

It is possible to create GIN trgm_ops expression index on the compound (concatenated) expression. This index can faciliate % similarity operator, but not the similarity function.

zlatko
  • 596
  • 1
  • 6
  • 23