3

For my master thesis I am analyzing several algorithms that could be useful for a mobile service provider (test data sets are based on a mobile music school) to find the optimal teacher for a new student taking the locations of a teacher's existing students into account.

The attached code provides correct results for a simple KNN (k-nearest neighbor) search avoiding duplicates.

As "DISTINCT ON" requires st.teacher_id to be included in the ORDER BY clause, the R-Tree-Index I have on my geometry column "address_transform" is not used. This leads to a very bad performance once table size gets larger (100k rows for the student table), the geometry gets more complex, etc.

Any ideas how to rewrite the function so that the index gets used?

CREATE OR REPLACE FUNCTION thesis_knn_distinct (q_neighbors_no integer, q_latitude numeric, q_longitude numeric, q_instrument text, q_student_table character varying)
RETURNS TABLE (
student_id TEXT,
teacher_id TEXT,
distance DOUBLE PRECISION,
instrument TEXT[]
)
AS $$
DECLARE
    location_txt varchar(50) := 'SRID=4326;POINT('||q_longitude||' '||q_latitude||')';
    teacher_table varchar(25);
BEGIN
    IF q_student_table LIKE 'student_hesse%' THEN
        teacher_table = 'teacher_synth_large';
    ELSIF [...]
    END IF;
RETURN QUERY EXECUTE 'WITH teacher_filter AS (
    SELECT DISTINCT ON (st.teacher_id) st.id, st.teacher_id, ST_DistanceSphere(address_box, $2::geometry) AS distance, te.instrument 
        FROM '|| q_student_table::regclass ||' st INNER JOIN '|| teacher_table::regclass ||' te 
        ON st.teacher_id = te.id 
        WHERE te.instrument @> ARRAY[$1]::text[] 
        ORDER BY st.teacher_id, st.address_transform <-> ST_Transform($2::geometry,3857)
    )   
    SELECT * FROM teacher_filter 
        ORDER BY distance
        LIMIT $3;'
    USING q_instrument, location_txt, q_neighbors_no;       
END; $$

LANGUAGE 'plpgsql';

Annotations:

  • I'm using a dynamic query as I'm testing with several tables of real/synthetic data (indexed, non-indexed, clustered, etc.)

  • I am aware of the possibility to set configuration parameters like enable_seqscan but that's not really a permanent solution to my problem

  • As an alternative I have already implemented a (pretty fast) variation where I pre-select a multiple of the required neighbors via simple KNN and then remove duplicates in a second step. This works ok for a purely distance-related approach but the pre-selection does not necessarily contain the best matches if other parameters apart from distance are taken into account at a later step as well.

  • I am using postgres 10.4, postgis 2.4.4

lem_s
  • 31
  • 2

0 Answers0