I'm writing an algorithm for an application that has over 400,000 users.
Tables
User <- holds users and their latitude, longitude, and the last time they were active.
Swipes <- when a user has already seen someone, a record is inserted in here.
My matches algorithm should fetch Users that are within a certain distance from the requesting user, that they have not seen before, and should also fetch a combination of users that have been active and users that have not in a while.
I have tried my best to document the implementation as much as possible so that it is easily understandable. This is my current implementation:
SELECT id,
distance
FROM
(
-- This is done so that the users that returned can be numbered 1 through x partitioned by the buckets and ordered by the distance away.
SELECT id,
distance,
row_number() OVER (PARTITION BY buckets
ORDER BY distance) AS bucket_interval
FROM
(
-- This inner query will fetch all of the users and create a column called "buckets" that will separate users depending on how active they are
SELECT id,
-- This is the "buckets" column
CASE
WHEN now() - last_active_at < interval '1' DAY THEN 1
WHEN now() - last_active_at < interval '2' DAY THEN 2
WHEN now() - last_active_at < interval '5' DAY THEN 3
WHEN now() - last_active_at < interval '10' DAY THEN 4
ELSE 5
END AS buckets,
-- This column gets the distance of the current user to the other users
3958.755864232 * 2 * ASIN(SQRT(POWER(SIN((51.6900092 - users.latitude) * PI() / 180 / 2), 2) + COS(51.6900092 * PI() / 180) * COS(users.latitude * PI() / 180) * POWER(SIN((-8.14594 - users.longitude) * PI() / 180 / 2), 2))) AS distance,
FROM "users"
-- This first condition will make sure people are within the desired distance (0 to 100 miles away in this case)
WHERE (users.latitude BETWEEN -31.15177391077796 AND 134.31179231344798
AND users.longitude BETWEEN -252.66535853758625 AND 163.78387854758624
AND (3958.755864232 * 2 * ASIN(SQRT(POWER(SIN((51.6900092 - users.latitude) * PI() / 180 / 2), 2) + COS(51.6900092 * PI() / 180) * COS(users.latitude * PI() / 180) * POWER(SIN((-8.14594 - users.longitude) * PI() / 180 / 2), 2)))) BETWEEN 0.0 AND 100)
-- This second condition will make sure the current user hasn't swiped through this person already
AND users.id NOT IN
(SELECT "swipes"."connection_id"
FROM "swipes"
WHERE user_id = currentUserId)
) x
) xx
-- This is done because I only want to fetch 50 matches at a time. Since there are "5" buckets... each of them will have 10 people ordered by distance
WHERE bucket_interval <= 10
ORDER BY distance ASC
LIMIT 50
This is implementation in ruby on rails and I'm using a gem called geocoder
to get the actual distance and all of that.
However, the ruby implementation is not as important as how should this query be written to be as optimal as possible.
It's not running fast.
Thanks
Edit: Here's the query plan
[
{
"Plan": {
"Startup Cost": 80238.35,
"Plans": [
{
"Startup Cost": 80238.35,
"Plans": [
{
"Startup Cost": 80238.24,
"Plans": [
{
"Startup Cost": 80238.24,
"Plans": [
{
"Startup Cost": 80238.24,
"Plans": [
{
"Filter": "(((NOT archived) OR (archived IS NULL)) AND (NOT is_suspended) AND (image_urls IS NOT NULL) AND (latitude >= (-90.151773910848)::double precision) AND (latitude <= 199.311792310848::double precision) AND (longitude >= (-255.665358277586)::double precision) AND (longitude <= 243.783878277586::double precision) AND (NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)) AND ((lower((gender)::text) = 'f'::text) OR (lower((gender)::text) = 'female'::text)) AND ((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-44.9774)::double precision - longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))) >= 0::double precision) AND ((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-44.94074)::double precision - longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))) <= 10000::double precision))",
"Startup Cost": 1722.06,
"Plans": [
{
"Startup Cost": 0.11,
"Scan Direction": "Forward",
"Plan Width": 4,
"Node Type": "Index Scan",
"Index Cond": "(user_id = 231415)",
"Plan Rows": 1955,
"Relation Name": "meets",
"Alias": "meets",
"Parent Relationship": "SubPlan",
"Total Cost": 1713.44,
"Subplan Name": "SubPlan 2",
"Index Name": "index_meets_on_user_id"
}
],
"Node Type": "Seq Scan",
"Plan Rows": 4,
"Relation Name": "users",
"Alias": "users",
"Parent Relationship": "Outer",
"Plan Width": 28,
"Total Cost": 80238.23
}
],
"Sort Key": [
"(CASE WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '1 day'::interval day) THEN 1 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '2 days'::interval day) THEN 2 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '5 days'::interval day) THEN 3 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '10 days'::interval day) THEN 4 ELSE 5 END)",
"((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - users.latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((users.latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-5.94074)::double precision - users.longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))))"
],
"Plan Rows": 4,
"Node Type": "Sort",
"Parent Relationship": "Outer",
"Plan Width": 28,
"Total Cost": 80238.24
}
],
"Plan Rows": 4,
"Node Type": "WindowAgg",
"Parent Relationship": "Subquery",
"Plan Width": 28,
"Total Cost": 80238.33
}
],
"Node Type": "Subquery Scan",
"Plan Rows": 1,
"Filter": "(xx.bucket_interval <= 10)",
"Alias": "xx",
"Parent Relationship": "Outer",
"Plan Width": 12,
"Total Cost": 80238.35
}
],
"Sort Key": [
"xx.distance"
],
"Plan Rows": 1,
"Node Type": "Sort",
"Parent Relationship": "Outer",
"Plan Width": 12,
"Total Cost": 80238.35
}
],
"Plan Rows": 1,
"Node Type": "Limit",
"Plan Width": 12,
"Total Cost": 80238.35
}
}
]