4

Firstly, I admit that my experience with spatial functions is very minimal. I have a table in MySQL with 20 fields and 23549187 records that contain geographical data. One of the fields is 'point' which is of point data type and has spatial index on it. I have a query that selects all points within a polygon which looks like this,

select * from `table_name` where ST_CONTAINS(ST_GEOMFROMTEXT('POLYGON((151.186 -23.497,151.207 -23.505,151.178 -23.496,151.174 -23.49800000000001,151.176 -23.496,151.179 -23.49500000000002,151.186 -23.497))'), `point`)

This works well as the polygon is small. However, if the polygon gets massive, the execution times gets really slow and the slowest query until now ran for 15 mins. Adding the index had really helped to bring it down to 15 mins which otherwise would have taken close to an hour. Is there anything I can do here for further improvement. This query will be run by a PHP script that runs as a daemon and I am worried if this slow queries will bring the MySQL server down.

All suggestions to make it better are welcome. Thanks.

EDIT:

show create table;

CREATE TABLE `table_name` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `lat` float(12,6) DEFAULT NULL,
  `long` float(12,6) DEFAULT NULL,
  `point` point NOT NULL,
  PRIMARY KEY (`id`),
  KEY `lat` (`lat`,`long`),
  SPATIAL KEY `sp_index` (`point`)
) ENGINE=MyISAM AUTO_INCREMENT=47222773 DEFAULT CHARSET=utf8mb4

There are few more fields that I am not supposed to disclose it here however the filter won

Explain sql output for the slow query:

+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | table_name | ALL  | NULL          | NULL | NULL    | NULL | 23549187 | Using where |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+

Explain sql output for query with smaller polygons,

+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table      | type  | possible_keys | key      | key_len | ref  | rows | Extra       |
+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+
|  1 | SIMPLE      | table_name | range | sp_index      | sp_index | 34      | NULL |    1 | Using where |
+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+

Looks like the biggest polygon does not use the index.

user1960364
  • 1,951
  • 6
  • 28
  • 47
Rohith Mohan
  • 187
  • 1
  • 14
  • 2
    You might have better luck moving this to https://dba.stackexchange.com or possibly https://gis.stackexchange.com/ – miken32 Nov 07 '18 at 02:14
  • `SHOW CREATE TABLE` and `EXPLAIN` please. – bishop Nov 07 '18 at 02:40
  • @miken32 Thanks for the suggestion. Might post it there as well. – Rohith Mohan Nov 07 '18 at 03:34
  • @bishop Updated the question with `CREATE TABLE` and `EXPLAIN` – Rohith Mohan Nov 07 '18 at 07:19
  • 1
    With massive, do you mean big size (big bounding box, like covering the whole earth), and/or a complicated shape of thousands of points, and/or containing a lot of/all of the points? MySQL seems to have decided on the latter one, can you verify that this is the case (e.g. that you get a large portion of your 23million points in your result set)? – Solarflare Nov 07 '18 at 07:55
  • @Solarflare The polygon has a complicated shape with 6065 points. – Rohith Mohan Nov 08 '18 at 22:36
  • 1
    The slow one or both? The more important question is how many points are in it, and specifically how many points are in the bounding box of the polygon (if you make a square around the polygon, how many points would be in it - that is the part that the index provide). There might not be a simple way to speed up your query though, but it would help to have a general idea about the polygon shape (see bishops answer): if your polygon e.g. describes a circle, you can get rid of 60% of your comparisons by using an inner square (easy for the index), and then union the remaining 4 circle parts to it. – Solarflare Nov 09 '18 at 12:50
  • @Solarflare Apologies for not phrasing it properly. What I meant was, the polygon that makes the query slow is formed by 6065 points. The query should return around 350,000 records meaning there are around 350,000 points with in the polygon. However, I am not sure about how many points would fall in the bounding box. The shape of the polygon would differ everytime with no guarantee of it being a square or circle all the time. I am really a beginner in this so please do correct me if I haven't understood it right. – Rohith Mohan Nov 10 '18 at 23:53
  • 1
    These are some large numbers. Checking for probably 500k points (that are in the vicinity of your polyogon) for a 6k polygon will take significant time. I am not sure if you can optimize your query from a general viewpoint. You may need to incorporate other factors about your situation. E.g.: if your polygon is a 6k polygon of florida, and you want to know which of your customers is inside it, you could store/precalcuate that someone lives in Miami, and then you only need to check if Miami is in Florida to get 30% of your results in one comparison. – Solarflare Nov 11 '18 at 12:04
  • 2
    Or maybe you can simplify your polygon, as you do not need 4k points to describe the coastline of key west in detail, as there is no reasonable expectation that someone actually lives in the golf of mexico, so 10 points instead of 4k might suffice. There likely will be such a clustering or reasoning for your problem, the hard task might be to find one. You would need to add details about your problem, and/or you may want to describe your problem on https://gis.stackexchange.com/, maybe they have an idea. – Solarflare Nov 11 '18 at 12:04
  • @solarflare The application allows user to select an area from map and it is supposed to return the customers in the area selected. Users can select any sized polygon and that is what causes the entire problem. Thanks for your suggestion. I might open this issue on gis.stackexchange.com and see if they have any idea. – Rohith Mohan Nov 11 '18 at 23:20
  • Cross-posted as https://gis.stackexchange.com/q/302220/115 – PolyGeo Nov 12 '18 at 01:01

1 Answers1

2

MySQL uses R-Trees for indexing spatial data. Like B-Tree indexes, these are optimal for queries targeting a small subset of the total number. As your bounding polygon gets larger the number of possible matches increases and, at some point, the optimizer decides it is more efficient to switch to a full table scan. That appears to be the scenario here, and I see three options:

First, try adding a LIMIT to your query. Normally, MySQL ignores the index if the optimizer concludes fewer I/O seeks would occur in a full table scan. But, with B-Tree indexes at least, MySQL will short-circuit that logic and always perform the B-Tree dive when LIMIT is present. I hypothesize R-Tree have a similar short-circuiting.

Second, and similar in spirit to the first, try forcing MySQL to use the index. This instructs MySQL that the table scan is more expensive than the optimizer decides. Understand that the optimizer only has heuristics and doesn't really know how "expensive" things are beyond what its internal statistics conclude. We humans have intuition, which sometimes - sometimes - knows better.

select * force index (`sp_index`) from `table_name` where ST_CONTAINS(ST_GEOMFROMTEXT('POLYGON((151.186 -23.497,151.207 -23.505,151.178 -23.496,151.174 -23.49800000000001,151.176 -23.496,151.179 -23.49500000000002,151.186 -23.497))'), `point`)

Finally, if those don't work, then what you need to do is break up your bounding polygon into smaller polygons. For example, if your bounding polygon is a square 500km per side, break it up into 4 squares 250km on each side, or 16 squares 125km per side, etc. Then UNION all of these together. The index will be used on each one, and the cumulative result may be faster. (Note it's important to UNION them together: MySQL cannot apply multiple range scans on a spatial query.)

bishop
  • 37,830
  • 11
  • 104
  • 139
  • 1
    That was really informative. I did try forcing the index but did not make any difference and in fact delayed the execution. I would try finding out a way to break the polygon and do union. Since it is not a square all the time it would not be that easy. – Rohith Mohan Nov 07 '18 at 23:51
  • 1
    @RohithMohan [There are algorithms for partitioning polygons.](https://geidav.wordpress.com/2015/03/21/splitting-an-arbitrary-polygon-by-a-line/) The basic idea is to divide into halves of near equal area and then to repeat that process on each half, however many times as are necessary. – bishop Nov 08 '18 at 00:42