0

I am trying to cluster on spatial locality (Not just create a spatial index), but SQL Server does not allow this. To create a spatial index it first wants me to create a clustered primary key, which nothing makes sense to cluster on. I want to create a spatial index and then cluster on spatial location in some way.

I have an idea to create bins that bin each geometry into a certain bin which then gets some integer. Then set that as the required clustered primary key, that way at least some of my data is clustered close together spatially.

I am kind of baffled SQL server doesnt do this already, so either I am missing out on how to do this or most likely someone has thought of this and someone can proposed a good enough solution.

I want to cluster on spatial location because I am dealing with big data and the first filter I do is by spatial location (creating tiles of maps), without clustering on spatial location my pages are now scattered based on some meaningless auto increment integer.

If a simple implementation of binning by spatial location hasn't been proposed, I figured I could just cut the bounds of my geometry into equal squares and then for each center point run a distance formula that includes all geometries that intersect that bin.

This is not specific to SQL server per say, I am looking for general approaches to solving this index/clustering on spatial location. I assume non-mssql databases may come with this functionality built in.

ParoX
  • 5,685
  • 23
  • 81
  • 152

1 Answers1

0

I don't see how this would be possible, regardless of implementation. Specifically, the idea of a clustering key is so that you (the db engine) can tell the order in which rows should be stored. This is possible with every other datatype (and combination thereof) because ultimately you can say whether a given tuple is bigger, smaller, or equal to another. What metric would you use for generalized spatial data to say that one instance is bigger or smaller than another? Size? Proximity to the origin? Some other measure? There isn't a well-defined sense of that in the general case, and so you can't do it.

But all is not lost. Just assign an arbitrary identifier to your rows (i.e. an identity column or a column populated by a sequence) and cluster on that. Then you can put a spatial index on that and go to town. Looking at your problem, if your bins are pre-defined, you can put those in another table and do a join using STIntersects. But that may be putting the cart before the horse.

Ben Thul
  • 31,080
  • 4
  • 45
  • 68
  • Yeah I cluster on a arbitrary key now and added my spatial index. The spatial index is ignored by SQL server, because it would rather first use a different cluster PK from a table being JOINed and then just do a scan of the PK on my geo table (It does this because right now if it did use my spatial index then data would be scattered). If I could find a way to cluster in some of meaningful spatial way I think SQL server would use my spatial index, because the leaf nodes wont be on such unclustered pages. – ParoX Jun 11 '15 at 01:32
  • Spatial index usage by the optimizer isn't great. Have you tried using a query hint to influence the optimizer to use it? – Ben Thul Jun 11 '15 at 03:35
  • In this case it is making a smart move because the geometry table is clustered randomly (because I clustered on a auto increment key). There are 100 million geometries and it finds 2,000 results. These 2,000 results probably belong to close to 2,000 separate pages because the crap clustering. My assumption is that SQL server realized while the spatial index will do a better job of filtering, its too unclustered and opts for using a filter that not as good, but more clustered. – ParoX Jun 11 '15 at 03:42
  • If by somehow I can cluster on something say like, zipcodes, I would best the number of page reads it has to do is far less, more like 1-100. I dont have zipcodes to these geometries...once option for a solution would be some type of zipcode to long/lat lookup, or something of that nature. – ParoX Jun 11 '15 at 03:47
  • You're focusing a lot on clustering as a solution to all of your performance problems. Expand your mind. As I said before, the optimizer is bad (in my experience) at choosing the spatial index correctly. Also, red flag for using the words "geometry" and "zipcode"; if your data represents points on the globe, you should be using the geography type. – Ben Thul Jun 11 '15 at 16:47
  • The geometry is stored with SRID 3857 so it is pre-projected for use in mapping. Technically this is geography but there is more overhead to storing as geography then geometry and for our use-case storing as geometry and converting to geography when we need it is good for us. Just wanted to point that out, but I will considering other solutions but forcing the optimizer to use the spatial index performs as I expected, the geometry table has 98409 logical reads. – ParoX Jun 11 '15 at 17:06