Poor performance of postgresql KNN queries

Question

There are two DB machines identical in structure and configurations with the characteristics as below. The databases are also identical in structure and contents. The performance of almost 99% of queryes is better on the new machine, only on few is is drastically degraded. Trying to narrow the investigation, I finally found that only one KNN (K Nearest Neighbour) query is the actual offender. Here is the relevant info about the table "maps" (about 820k rows) :

\d maps
                     Table "maps"
   Column   |       Type        | Collation | Nullable |                       
------------+-------------------+-----------+----------
 id         | integer           |           | not null | 
 strada     | character varying |           |          | 
(...)

 the_geom   | geography         |           |          | 
 concat_all | character varying |           |          | 
(...)

Indexes:
    "pk_maps" PRIMARY KEY, btree (id), tablespace "maps_idx"
    "ggx_maps_the_geom" gist (the_geom) WITH (fillfactor='100', buffering='on'), tablespace "maps_idx"
    "ix_maps_strada" hash (strada), tablespace "maps_idx"

Tablespace: "maps_data"

In the following "EXPLAIN ANALYZE" investigation I used a 'point' for test (in reality nothing else but a long irelevant string, for the sake of lizibility just replaced by 'point' and transformed by st_buffer in 'buffer')

EXPLAIN ANALYZE
  SELECT *
  FROM maps m
  WHERE st_buffer('point'::GEOGRAPHY,100::DOUBLE PRECISION) && m.the_geom
  ORDER BY m.the_geom<->'point'::GEOGRAPHY
  LIMIT 1;

The following are the actual machine responses :

(Machine#1 : 16CPU | 64GB RAM | Linux Ubuntu 16.04.7 LTS | PostgreSQL 9.6.21)

 Limit  (cost=0.29..1.96 rows=1 width=711) (actual time=1.306..1.306 rows=1 loops=1)
   ->  Index Scan using ggx_maps_the_geom on maps m  (cost=0.29..1.96 rows=1 width=711) (actual time=1.305..1.305 rows=1 loops=1)
         Index Cond: ('buffer'::geography && the_geom)
         Order By: (the_geom <-> 'point'::geography)
 Planning time: 2.977 ms
 Execution time: 1.480 ms
(6 rows)
Time: **6.885 ms**

(Machine#2 : 16CPU | 64GB RAM | Linux Ubuntu 20.04.2 LTS | PostgreSQL 13.2)

 Limit  (cost=0.29..1.96 rows=1 width=692) (actual time=66.421..66.422 rows=1 loops=1)
   ->  Index Scan using ggx_maps_the_geom on maps m  (cost=0.29..1.96 rows=1 width=692) (actual time=66.420..66.420 rows=1 loops=1)
         Index Cond: (the_geom && 'buffer'::geography)
         Order By: (the_geom <-> 'point'::geography)
 Planning Time: 20.289 ms
 Execution Time: 66.477 ms
(6 rows)
Time: **88.116 ms** (!)

As can be seen, a drastical performance degradation on the machine#2, despite the fact that the two machines are cvasi-identical in all of the aspects, including configurations as well. Does anyone have any idea how to deal with this ? What to check to get some clues ?

Please turn track_io_timing on do `explain (analyze, buffers)`. — jjanes, Apr 21 '21 at 01:24
The planning time seems very long as well as the execution time. Was this the first command issued on a new connection? — jjanes, Apr 21 '21 at 01:26
lizibility = ease of reading of this post for you : 'point' = '0101000020E610000071EA5E82AE283A401E882CD2C4374640' but I geess is not relevant in the output of "EXPLAIN ANALYZE" since this happens with ANY real 'point', no matter if the command is issued first or several consecutive times; so, the cache is in it's place since ANY OTHER query is performing identical on each machine; the problem is ONLY with THIS kind of query. — Liviu Nastase, Apr 21 '21 at 07:06
After doing SET track_io_timing = on on machine#1 : Limit (cost=0.29..1.96 rows=1 width=12) (actual time=1.164..1.164 rows=1 loops=1) Buffers: shared hit=46 -> Index Scan using ggx_maps_concatdrumlocjud_the_geom on maps_concatdrumlocjud m (cost=0.29..1.96 rows=1 width=12) (actual time=1.164..1.164 rows=1 loops=1) Index Cond: ('buffer'::geography && the_geom) Order By: (the_geom <-> 'point'::geography) Buffers: shared hit=46 Planning time: 2.422 ms Execution time: 1.330 ms (8 rows) Time: 5.681 ms — Liviu Nastase, Apr 21 '21 at 07:23
After doing SET track_io_timing = on on machine#2 : Limit (cost=0.29..1.96 rows=1 width=692) (actual time=65.767..65.768 rows=1 loops=1) Buffers: shared hit=32 -> Index Scan using ggx_maps_concatdrumlocjud_the_geom on maps_concatdrumlocjud m (cost=0.29..1.96 rows=1 width=692) (actual time=65.766..65.767 rows=1 loops=1) Index Cond: (the_geom && 'buffer'::geography) Order By: (the_geom <-> 'point'::geography) Buffers: shared hit=32 Planning: Buffers: shared hit=6 Planning Time: 18.841 ms Execution Time: 65.821 ms (10 rows) Time: 85.294 ms — Liviu Nastase, Apr 21 '21 at 07:24
Actually the planning time is not a real problem, but the huge difference between the estimated execution time and the real execution time. — Liviu Nastase, Apr 21 '21 at 07:26
Sorry, I've made a new check right now with postgis_full_version() and here are the results : Machine#1 : POSTGIS="2.4.4 r16526" PGSQL="96" GEOS="3.7.1-CAPI-1.11.1 27a5e771" PROJ="Rel. 4.9.2, 08 September 2015" GDAL="GDAL 1.11.3, released 2015/09/16" LIBXML="2.9.3" LIBJSON="0.11.99" LIBPROTOBUF="1.2.1" (core procs from "2.4.3 r16312" need upgrade) RASTER (raster procs from "2.4.3 r16312" need upgrade); Machine#2 : POSTGIS="3.1.1 aaf4c79" [EXTENSION] PGSQL="130" GEOS="3.9.0-CAPI-1.16.2" SFCGAL="1.3.7" PROJ="6.3.1" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)"; — Liviu Nastase, Apr 21 '21 at 18:45
When installing from PGDG apt repository, I get a different geos version `GEOS="3.9.0-CAPI-1.16.2"`. And I can't reproduce the problem. I populated the table with a million st_makepoint(random()*90,random()*90), and searched with point `point(0 0)` and increasing the buffer to 100000. If you can't share your real data with us, could you reproduce my random method data on your machine and see if the problem effects that? — jjanes, Apr 21 '21 at 21:29
Ah, I think you meant 'visibility'. I thought 'lizibility' was some specialist term from abstract algebra or something. — jjanes, Apr 21 '21 at 21:33
Absolutly the same with the generated table. Guess the only rational choice now is to reinstall from scratch and check after each step. Thank you for your time. — Liviu Nastase, Apr 22 '21 at 09:23
... and indeed, restarting all installation from scratch was a good choice. Starting with Ubuntu 16.04 + PostgreSQL 13.2, then populate database. After that, upgrade to Ubuntu 18.04 and finally to Ubuntu 20.04. The query that in the past lasted at least 70 milliseconds, now is answering after ... at most 5 milliseconds. Not just the first time, but succesively. I do not know if it was the only choice, however it worked for me. — Liviu Nastase, Apr 25 '21 at 14:53

Poor performance of postgresql KNN queries

0 Answers0