2

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.

Suppose we have a national database of music teachers with 250,000 rows:

  teacherName, address, city, state, zipcode, geolocation, primaryInstrument

where the geolocation column is a geography::point datatype with optimally tesselated index.

User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.

But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.

Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.

So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?

I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.

P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.

EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.

select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number() 
over(partition by 
        STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress 
      order by DistanceInMiles)
from
(
select
SINGER.name, 
SINGER.streetaddress, 
SINGER.city, 
SINGER.state, 
SINGER.zipcode, 
TEACHERS.name as TEACHERname, 
TEACHERS.streetaddress as TEACHERaddress, 
TEACHERS.city as TEACHERcity, 
TEACHERS.state as TEACHERstate, 
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
            / (1.6 * 1000) as DistanceInMiles
from 
SINGER left join TEACHERS
on 
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))   
     < (SINGER.radius * (1.6 * 1000 ))
 and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6    -- closest 5 is an abitrary requirement given to us
Tim
  • 8,669
  • 31
  • 105
  • 183
  • Could you post your query (and may be some test data) so we can try to run it? I've tried ~100000 rows on my desktop and it perform relatively well (on random data). I also think that index on primaryInstrument column could speed up the query. – Roman Pekar Jul 01 '13 at 20:05
  • The problem is that random data will probably perform better than actual data, since in metro areas there's much higher population density. Retrieving every piano teacher within 50 miles of NYC is a lot different than retrieving every piano teacher within 50 miles of Fargo, North Dakota (~Siberia of USA). – Tim Jul 01 '13 at 22:07
  • Also the main issue is that Top 5 is an absolute requirement, but is impossible to know which radius will encompass 5. – Tim Jul 01 '13 at 22:09

1 Answers1

1

I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.

select
    S.name, 
    S.streetaddress, 
    S.city, 
    S.state, 
    S.zipcode, 
    T.name as TEACHERname, 
    T.streetaddress as TEACHERaddress, 
    T.city as TEACHERcity, 
    T.state as TEACHERstate, 
    T.zipcode as TEACHERzip,
    T.teacherid,
    T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
    / (1.6 * 1000) as DistanceInMiles
from SINGER as S
    outer apply (
        select top 5 TT.*
        from TEACHERS as TT
        where TT.primaryInstrument='voice'
        order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
    ) as T
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Your query is viable and it performs well much of the time. But the more popular the instrument, the slower it gets. When the primaryInstrument is very popular e.g. "piano", the subset returned when filtering on primaryInstrument has many rows. Perhaps 30% of music teachers teach piano, so 75,000 rows must be sorted 250 times by geographic distance, a calculated value. The uneven distribution of primaryInstrument is why I made that column an ancillary filter on the join, not the primary filter. "5 closest" is a difficult totally arbitrary requirement handed to me. – Tim Jul 02 '13 at 10:45
  • Well it's interesting problem, but it's hard to get good results when I don't have your data to test what's work :( – Roman Pekar Jul 02 '13 at 10:58
  • Accepting the answer because it does solve the main problem of the unknowable minimum radius by sidestepping it, although with a performance penalty. – Tim Jul 03 '13 at 16:55
  • @Tim recursive retrievals are generally slow – Grijesh Chauhan Jul 08 '13 at 13:32