Gist index on Postgres/PostGIS still slow

Question

I am not an expert at Postgres/GIS subjects and I have an issue with a large database (over 20 million records) of geometries. First of all my set up looks like this:

mmt=# select version();
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------
version | PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

mmt=# select PostGIS_Version();
-[ RECORD 1 ]---+--------------------------------------
postgis_version | 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1

The table that I am querying contains the following columns:

mmt=# \d titles
                                              Table "public.titles"
        Column        |           Type           | Collation | Nullable |                 Default                 
----------------------+--------------------------+-----------+----------+-----------------------------------------
 ogc_fid              | integer                  |           | not null | nextval('titles_ogc_fid_seq'::regclass)
 wkb_geometry         | bytea                    |           |          | 
 timestamp            | timestamp with time zone |           |          | 
 end                  | timestamp with time zone |           |          | 
 gml_id               | character varying        |           |          | 
 validfrom            | character varying        |           |          | 
 beginlifespanversion | character varying        |           |          | 
 geom_bounding_box    | geometry(Geometry,4326)  |           |          | 
Indexes:
    "titles_pkey" PRIMARY KEY, btree (ogc_fid)
    "geom_idx" gist (geom_bounding_box)

The geom_bounding_box column holds the bounding box of the wkb_geometry. I have created that bounding box column because the wkb geometries exceed the default size limits for items in a GIST index. Some of them are quite complex geometries with several dozens of points making up a polygon. Using a bounding box instead meant I was able to put an index on that column as a way of speeding up the search.. at least that's the theory.

My search aims to find geometries which fall within 100 metres of a given point as follows, however this takes well over two minutes to return. I want to get that under one second!:

select ogc_fid, web_geometry from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);

Below is a basic explain output. What can I do to speed this thing up?

Thank you!

mmt=# explain select ogc_fid from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Gather  (cost=1000.00..243806855.33 rows=2307 width=4)
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |   Workers Planned: 2
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |   ->  Parallel Seq Scan on titles  (cost=0.00..243805624.63 rows=961 width=4)
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |         Filter: st_dwithin((geom_bounding_box)::geography, '0101000020E61000006878B306EFABC2BF6308008E3DC14940'::geography, '100'::double precision, true)
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | JIT:
-[ RECORD 6 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |   Functions: 4
-[ RECORD 7 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |   Options: Inlining true, Optimization true, Expressions true, Deforming true

"several dozens of points" is not large or complex for PostGIS - I would simply index the geometry column, PostGIS will use the bounding box automatically. — Ian Turton, May 11 '21 at 07:35
@IanTurton mmt=# create index geometry_idx on titles using gist(wkb_geometry); ERROR: index row requires 16368 bytes, maximum size is 8191 Hence the need to squash down the size of those geometry records, which is what the bounding box is for. the Index alone can't cope with the object sizes, it seems. — planetguru, May 11 '21 at 20:20
Which version of PostGIS is that? I have never seen an issue with much larger geometries than that. — Ian Turton, May 12 '21 at 07:50
@IanTurton - mmt=# select PostGIS_Version(); postgis_version --------------------------------------- 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 — planetguru, May 12 '21 at 08:00

Laurenz Albe · Answer 1 · 2021-05-12T04:50:22.443

5

The problem is that you are mixing geometry and geography, and PostgreSQL casts geom_bounding_box to geography so that they match.

Now you have indexed geom_bounding_box, but not geom_bounding_box::geography, which is something different.

Either use 'SRID=4326;POINT(-0.145872 51.509691)'::geometry as second operand or create the GiST index on ((geom_bounding_box::geography)) (note the double parentheses).

edited May 12 '21 at 04:50

answered May 10 '21 at 17:56

Laurenz Albe

209,280
17
206
263

Thank you Laurenz for the response. I have rebuilt the bounding_box and index and the query now has matching types (casting point to a geometry), but I still see a full scan: mmt=# explain select count(*) from titles where ST_DWithin(bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geometry, 100.0); QUERY PLAN ------------------- Finalize Aggregate (cost=294582.80..294582.81 rows=1 width=8) -> Gather (cost=294582.58..294582.79 rows=2 width=8) – planetguru May 11 '21 at 19:27
1

Then PostgreSQL thinks that using the index is more expensive. You could disable parallel query (`max_parallel_workers_per_gather = 0`) or lower `random_page_cost`. – Laurenz Albe May 12 '21 at 04:53
1

As amanin points out below, if you run st_dwithin on geometries, the distance is measured in degrees for SRID 4326, which I doubt is what you want. – mlinth May 12 '21 at 13:41

amanin · Answer 2 · 2021-05-12T16:06:32.170

EDIT: As pointed out by mlinth, my answer below is not really valid. It raises a danger though: beware of the arguments given to the ST_DWithin function, because the unit of distance argument is inferred differently depending if you give geographies (meters) or geometries (srid unit).

According to the ST_DWithin doc, the distance is specified in SRID unit. In your case, the spatial reference system is a geographic one, so your 100 value means 100 degree radius, not 100 meters. That means approximately the entire world. In such case, efficiently using the index is impossible.

If you want to find geometries in a 100 meter radius, you must convert a 100 meter in degree unit, but that depends on latitude (if you want to be accurate).

To start, I'd recommend you to use a (very) approximate shortcut: 100 meters at the equator is (very) approximately equal to 0.001 degrees. So replace your distance value with it, and if it speed up things (and I'm pretty convinced it will), then you will be able to refine your query to be more accurate.

If you look at the query plan, Postgres is actually casting the geom to geography, so in this case the argument is indeed 100m (the geography version of st_dwithin uses distance in m). The cast is what is killing the performance. But, your basic point is right, I think; either use geometries with an appropriate projection, or geographies and make sure the indexes are right. — mlinth, May 12 '21 at 13:35

score 3 · Accepted Answer · answered May 13 '21 at 21:39

I did resolve this and it was a combination of all of the above things, although not any one of them alone. As a quick summary:

Laurenz Albe was right in spotting the mix of geography and geometry types, which was easy to fix by removing the cast.

Ian Turton was also right in spotting that dozens of points shouldn't be an issue for a gist index, so I abandoned the bounding box approximation approach and went back to exploring the index issues. What I found was that the geometry column was defined with a data type of 'byte array' (bytea), which prevents creation of an spgist index due to 'no default operator class for access method "spgist"' This was resolved by changing the column type as follows:

mmt=# ALTER TABLE titles
ALTER COLUMN wkb_geometry
TYPE geometry
USING wkb_geometry::geometry;

The index then creates successfully (either gist or spgist) and I have been able to benchmark the two side by side, finding gist to be slightly more efficient in my use-case.

Amanin was also right to point out the differences between meters and radial degrees according to the spatial reference system. In some of my tests I was erroneously using the latter, but on very large radii. Since I'm indexing and searching with geometry types, that radius value needs to be very small in radial degrees in order to cover quite large areas. Fixed!

All put together, and searches across 26 million records consistently complete in 200ms to 500ms, with occasional spikes up to 1.1s. This is pretty good.

Thanks all who contributed input, ideas and discussion.

Gist index on Postgres/PostGIS still slow

3 Answers3