9

I'm trying a little experiment at pushing a data set which is not geo-spatial but fits it quite well and am finding the results somewhat unsettling. The data set is genomic data e.g. the Human Genome where we have a region of DNA where elements like genes occupy specific start and stop coordinates (our X axis). We have multiple regions of DNA (chromosomes) which occupy the Y axis. The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate e.g. LineString(START 1, END 2).

The theory seemed sound so I pushed it into an existing MySQL based genome project and came up with a table structure like:

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id represents the identifier of the entity we have encoded into this table & external_type encodes the source of this. Everything looked good and I pushed in some preliminary data (30,000 rows) which seemed to work well. When this increased past the 3 million row mark MySQL refused to use the spatial index and was slower when it was forced to use it (40 seconds vs. 5 seconds using a full table scan). When more data was added the index started to be used but the performance penalty persisted. Forcing the index off brought the query down to 8 seconds. The query I'm using looks like:

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

The data going into this is be very dense along the Y dimensions (think of it like you've recorded the position of every building, telephone box, post box and pigeon on a very long road). I've done tests of how R-Indexes behave with this data in Java as well as others in the field have applied them to flat-file formats with success. However no one has applied them to databases AFAIK which is the goal of this test.

Has anyone out there seen a similar behaviour when adding large quantities of data to a spatial model which is not very disparate along a particular axis? The problem persists if I reverse the coordinate usage. I'm running the following setup if that's a cause

  • MacOS 10.6.6
  • MySQL 5.1.46

Help!

Also bringing in explain plan in

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

The re-written SQL looks like the following

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

Still not highlighting why this query's performance is so poor

After reading the article posted by @Fraser from rickonrails it seems like the problem is all to do with the index not being in memory. If I apply similar techniques to those mentioned in the article (making key buffer very big indeed) and I then force the query to use the index query times plumet. We still see a lag between querying a region & then searching for a subset of the region but it's all pointing to getting the load of the indexes correct.

What's the moral of the story? R-Indexes in MySQL have quite poor performance until they are in memory and then they have excellent performance. Not really a good solution for what I wanted to do wit them but still it provides an interesting angle on MySQL.

Thanks for all the help people.

andeyatz
  • 428
  • 3
  • 9
  • you might get an answer at http://gis.stackexchange.com – dassouki Jan 21 '11 at 11:40
  • Cheers for the info will do a post on there as well – andeyatz Jan 21 '11 at 12:46
  • Can you post results of this query: EXPLAIN EXTENDED select count(*) from spatial_feature where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location); This would show how MySQL is executing it. That might highlight the bottleneck. – Shamit Verma Mar 14 '11 at 09:02
  • Extending the question to give the results of the explain plan. The plan still does not highlight any real reason why the performance is so poor. – andeyatz Mar 14 '11 at 20:17
  • Read this fella - http://rickonrails.wordpress.com/2009/03/30/big-ole-mysql-spatial-table-optimization-tricks/ – Fraser Mar 16 '11 at 16:15
  • So it looks like the issue is with the way the index is structured on disk & that MySQL has been unable to load it all into memory. I'll try a full load into memory & will post back the results. – andeyatz Mar 21 '11 at 10:17
  • Can also take a look at http://dev.mysql.com/doc/refman/5.0/en/using-spatial-indexes.html – Amit Jul 02 '14 at 11:08

4 Answers4

1

From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column). I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
Amnon
  • 2,212
  • 1
  • 19
  • 35
  • BTW, my mysql version is 5.5 – Amnon May 13 '14 at 12:50
  • This is the answer I was thinking too. I know in older versions of MySQL that index hints didn't always do what I expected - not sure if that fault was in the implementation or my expectations, but it would be interesting to see if this works for the OP. – ratsbane Mar 04 '15 at 00:09
0

I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)

EG: (where "x" = position and "y" = chromosome)

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)

conners
  • 1,420
  • 4
  • 18
  • 28
  • I was attempting to simplify the issue a bit more to try to reach a much large audience in the hope that it is something we bioinformaticians have gotten wrong. It seems the issue isn't in creating a spatial index which allows for this kind of searching (I've applied the same technique in programming memory to great success). More is it possible to use MySQL's spatial extensions. It seems even though they can outperform any kind of indexing for this type of data (binning, linear index or max feature size) the index *must* be in memory. That's a deal breaker if you've got 2300 DBs on 1 server – andeyatz Jun 27 '12 at 10:44
  • 1
    MySQL is compiled with certain "defaults" if you are brave and feel capable of going down this rabbit hole there are the forge.mysql guys.. you can compile mysql do obey certain mechanics.. it's hard core but it might be worth it http://forge.mysql.com/wiki/Top10SQLPerformanceTips – conners Jun 27 '12 at 15:13
  • there is a thing called ZFS there is a paper of getting ZFS to run in MySQL http://forge.mysql.com/w/images/9/9e/MySQL_and_ZFS.pdf it's basically the dogs b******** – conners Jun 27 '12 at 15:19
  • Thanks for the links. The ZFS+MySQL looks really good (not surprising considering it's Sun originally). We're also investigating lots of other things just to see if the overall performance of MySQL can be improved but if a new indexing strategy can be applied that's faster than the current strategy then that's a game changer in my book :) – andeyatz Jun 28 '12 at 09:26
0

Are you sure a relational database is the way to go? If I were you I'd look at pushing your datasets to Solr or Elastic Search (probably storing the master datasets elsewhere). These engines are built for indexing, you will notice the difference in response times.

NickJHoran
  • 597
  • 4
  • 13
0

The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate

Have you considered using an index wit multiple fields? Like:

CREATE INDEX spacial_search ON spatial_feature(y, x)

If you are working with a limited set of y values this is the way to.

vbence
  • 20,084
  • 9
  • 69
  • 118