Complex User Matching Algorithm Performance

Question

I'm writing an algorithm for an application that has over 400,000 users.

Tables

User <- holds users and their latitude, longitude, and the last time they were active. Swipes <- when a user has already seen someone, a record is inserted in here.

My matches algorithm should fetch Users that are within a certain distance from the requesting user, that they have not seen before, and should also fetch a combination of users that have been active and users that have not in a while.

I have tried my best to document the implementation as much as possible so that it is easily understandable. This is my current implementation:

SELECT id,
       distance
FROM
  ( 

  -- This is done so that the users that returned can be numbered 1 through x partitioned by the buckets and ordered by the distance away.
  SELECT id,
         distance,
         row_number() OVER (PARTITION BY buckets
                              ORDER BY distance) AS bucket_interval

   FROM
     (

     -- This inner query will fetch all of the users and create a column called "buckets" that will separate users depending on how active they are
      SELECT id,


            -- This is the "buckets" column
             CASE
                 WHEN now() - last_active_at < interval '1' DAY THEN 1
                 WHEN now() - last_active_at < interval '2' DAY THEN 2
                 WHEN now() - last_active_at < interval '5' DAY THEN 3
                 WHEN now() - last_active_at < interval '10' DAY THEN 4
                 ELSE 5
             END AS buckets,


             -- This column gets the distance of the current user to the other users
             3958.755864232 * 2 * ASIN(SQRT(POWER(SIN((51.6900092 - users.latitude) * PI() / 180 / 2), 2) + COS(51.6900092 * PI() / 180) * COS(users.latitude * PI() / 180) * POWER(SIN((-8.14594 - users.longitude) * PI() / 180 / 2), 2))) AS distance,

      FROM "users"


      -- This first condition will make sure people are within the desired distance (0 to 100 miles away in this case)

      WHERE (users.latitude BETWEEN -31.15177391077796 AND 134.31179231344798
             AND users.longitude BETWEEN -252.66535853758625 AND 163.78387854758624
             AND (3958.755864232 * 2 * ASIN(SQRT(POWER(SIN((51.6900092 - users.latitude) * PI() / 180 / 2), 2) + COS(51.6900092 * PI() / 180) * COS(users.latitude * PI() / 180) * POWER(SIN((-8.14594 - users.longitude) * PI() / 180 / 2), 2)))) BETWEEN 0.0 AND 100)


        -- This second condition will make sure the current user hasn't swiped through this person already

        AND users.id NOT IN
               (SELECT "swipes"."connection_id"
                FROM "swipes"
                WHERE user_id = currentUserId)
    ) x 
  ) xx

-- This is done because I only want to fetch 50 matches at a time. Since there are "5" buckets... each of them will have 10 people ordered by distance
WHERE bucket_interval <= 10
ORDER BY distance ASC
LIMIT 50

This is implementation in ruby on rails and I'm using a gem called geocoder to get the actual distance and all of that.

However, the ruby implementation is not as important as how should this query be written to be as optimal as possible.

It's not running fast.

Thanks

Edit: Here's the query plan

    [
  {
    "Plan": {
      "Startup Cost": 80238.35,
      "Plans": [
        {
          "Startup Cost": 80238.35,
          "Plans": [
            {
              "Startup Cost": 80238.24,
              "Plans": [
                {
                  "Startup Cost": 80238.24,
                  "Plans": [
                    {
                      "Startup Cost": 80238.24,
                      "Plans": [
                        {
                          "Filter": "(((NOT archived) OR (archived IS NULL)) AND (NOT is_suspended) AND (image_urls IS NOT NULL) AND (latitude >= (-90.151773910848)::double precision) AND (latitude <= 199.311792310848::double precision) AND (longitude >= (-255.665358277586)::double precision) AND (longitude <= 243.783878277586::double precision) AND (NOT (hashed SubPlan 1)) AND (NOT (hashed SubPlan 2)) AND ((lower((gender)::text) = 'f'::text) OR (lower((gender)::text) = 'female'::text)) AND ((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-44.9774)::double precision - longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))) >= 0::double precision) AND ((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-44.94074)::double precision - longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))) <= 10000::double precision))",
                          "Startup Cost": 1722.06,
                          "Plans": [
                            {
                              "Startup Cost": 0.11,
                              "Scan Direction": "Forward",
                              "Plan Width": 4,
                              "Node Type": "Index Scan",
                              "Index Cond": "(user_id = 231415)",
                              "Plan Rows": 1955,
                              "Relation Name": "meets",
                              "Alias": "meets",
                              "Parent Relationship": "SubPlan",
                              "Total Cost": 1713.44,
                              "Subplan Name": "SubPlan 2",
                              "Index Name": "index_meets_on_user_id"
                            }
                          ],
                          "Node Type": "Seq Scan",
                          "Plan Rows": 4,
                          "Relation Name": "users",
                          "Alias": "users",
                          "Parent Relationship": "Outer",
                          "Plan Width": 28,
                          "Total Cost": 80238.23
                        }
                      ],
                      "Sort Key": [
                        "(CASE WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '1 day'::interval day) THEN 1 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '2 days'::interval day) THEN 2 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '5 days'::interval day) THEN 3 WHEN ((now() - (users.last_active_at)::timestamp with time zone) < '10 days'::interval day) THEN 4 ELSE 5 END)",
                        "((7917.511728464::double precision * asin(sqrt((power(sin(((((54.5800092::double precision - users.latitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision) + ((0.579565539469435::double precision * cos(((users.latitude * 3.14159265358979::double precision) / 180::double precision))) * power(sin((((((-5.94074)::double precision - users.longitude) * 3.14159265358979::double precision) / 180::double precision) / 2::double precision)), 2::double precision)))))))"
                      ],
                      "Plan Rows": 4,
                      "Node Type": "Sort",
                      "Parent Relationship": "Outer",
                      "Plan Width": 28,
                      "Total Cost": 80238.24
                    }
                  ],
                  "Plan Rows": 4,
                  "Node Type": "WindowAgg",
                  "Parent Relationship": "Subquery",
                  "Plan Width": 28,
                  "Total Cost": 80238.33
                }
              ],
              "Node Type": "Subquery Scan",
              "Plan Rows": 1,
              "Filter": "(xx.bucket_interval <= 10)",
              "Alias": "xx",
              "Parent Relationship": "Outer",
              "Plan Width": 12,
              "Total Cost": 80238.35
            }
          ],
          "Sort Key": [
            "xx.distance"
          ],
          "Plan Rows": 1,
          "Node Type": "Sort",
          "Parent Relationship": "Outer",
          "Plan Width": 12,
          "Total Cost": 80238.35
        }
      ],
      "Plan Rows": 1,
      "Node Type": "Limit",
      "Plan Width": 12,
      "Total Cost": 80238.35
    }
  }
]

Are the distances short enough to approximate the distances using Pythagoras' theorem? The performance will be a lot better. — Greg Viers, Jan 31 '18 at 20:23
Instead of NOT IN, in many SQL implementations, implementing a left join and using IS NULL is a much faster idea. — theMayer, Jan 31 '18 at 20:24
@theMayer can you tell me more about this? Do you have experience dealing with very large datasets? The swipes table has over 125 million records. I read an article earlier saying the complete opposite which is why I went with NOT IN. Would love to hear your ideas on this — Walker, Jan 31 '18 at 20:29
Let's put it this way - 125 M is approaching what I would consider "large" :) — theMayer, Jan 31 '18 at 20:30
You will also need to configure indices as appropriate. Personally, I'd dump the relational db altogether and set up an implementation in something more appropriate, say [couchbase](https://www.couchbase.com/). — theMayer, Jan 31 '18 at 20:31
It's almost always better when dealing with big data to split up your query jobs. You pull a rough set from the underlying big data store, then do fine-grained processing using, for example, linq in c# prior to feeding to the client. — theMayer, Jan 31 '18 at 20:32
Can you display the query plan? I have no experience with postgres, but in Oracle and SQL Server you can do this fairly easily. — theMayer, Jan 31 '18 at 20:36
If cardinality of swipes is >> that of users, ensure that your plan works out users first and then goes into swipes. Also `and not exists (select 1 from swipes where ...)` may be better if you have many swipes per `(user_id,connection_id)`. Also, I assume you have all the indexes? — giorgiga, Jan 31 '18 at 20:49
@Walker pls use `explain` for the query plan: https://www.postgresql.org/docs/current/static/sql-explain.html pictures are useless :) — giorgiga, Jan 31 '18 at 20:53
@giorgiga going to try the not exists. Updated the plan. my bad — Walker, Jan 31 '18 at 20:59
@Walker is "swipes" a view over one "meets" table? your plan seems to scan "meets" as the first thing, which is not what you want if that table is way larger than "users" — giorgiga, Jan 31 '18 at 21:13
@giorgiga swipes and meets are interchangeable. Same table. That table has over 125 million records. By removing stuff around and testing with the query plans. It seems like what's costing the most is the `case when` with the `last_active_at`, also the distance filtering in the where clause and then the distance ordering at the end. Going to Swipes/Meets first could be clouding the results I got? As in fixing not going to meets first could show different results — Walker, Jan 31 '18 at 21:16
The best way to troubleshoot is to remove pieces and see what improves when you do. You may be surprised by the results. — theMayer, Jan 31 '18 at 21:24
@Walker IDK... how big is users? the `case` shouldn't be that intensive at all and the orthodromic distance... well... there is a bunch of trig there. How is performance if you comment out the filter on swipes/meets and only look at users? — giorgiga, Jan 31 '18 at 21:26
And your plan shows that that "NOT IN" is performing a table scan - on a 1.5M row table, that's gonna take awhile. I guarantee you that is the main problem. The complex math elsewhere should be relatively quick as it can be done in parallel. — theMayer, Jan 31 '18 at 21:29
@giorgiga Users is 403k and if I only leave the inner users query.. still finding the distance but this time not filtering based on distance or sorting by distance or doing the `case`... the cost is `1767` and it's pretty damn fast. If I then remove the `NOT IN` the cost is `32` which is close to nothing and it's way faster — Walker, Jan 31 '18 at 21:32
@Walker the cost in the query plan is an estimate not to be taken as too literal (ie : it's often quite wrong) - you can use `explain analyze` (which actually executes your query) and look at the time to get a better idea of where the db spends the most time — giorgiga, Jan 31 '18 at 21:36
@ to force a specific execution plan (eg: handle users first, then go to meets), you can try using `with`: the optimiser isn't really able to optimise across `with` subqueries (at least, it wasn't last time I used this trick) and so you can force it to follow _your_ plan instead. — giorgiga, Jan 31 '18 at 21:39
@giorgiga It's weird because the most amount of time will be in the `case`. Then when I remove the case... the time to execute doesn't change but then it's the distance filtering that takes long. Then if I remove that it just keeps getting switched around. I'm starting to think instead of optimizing this query which could be flawed. I should think of a better algorithm or way to get it to work. Was reading this to get some ideas https://stackoverflow.com/questions/31668037/user-matching-algorithm — Walker, Jan 31 '18 at 21:48
This is the reason why I ultimately stopped using RDBMS's. It turns out that I'm better at optimizing my data storage/retrieval than they are, and when you factor this type of stuff in, it takes far less time for me to do it. — theMayer, Feb 01 '18 at 15:32
@theMayer testing the queries through PGAdmin seemed faster but after pushing it to production (Ruby on Rails App)... it was WAY slower with the left joins. Weird — Walker, Feb 01 '18 at 19:52
I wonder if you're missing an index (or the query operator is missing an index). It looks like your view is also doing some significant filtering - that will also wreak havoc with the query optimizer. — theMayer, Feb 01 '18 at 20:52

score 0 · Answer 1 · answered Jan 31 '18 at 20:29

0

One thing to try is to replace your "NOT IN" statement, as these are known to be extremely slow.

Instead of

FROM "users"
...
AND users.id NOT IN
               (SELECT "swipes"."connection_id"
                FROM "swipes"
                WHERE user_id = currentUserId)

use

FROM "users" u
LEFT JOIN "swipes" s
        ON s.user_id = currentUserId

... 

WHERE   s.user_id IS NULL

answered Jan 31 '18 at 20:29

theMayer

15,456
7
58
90

https://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/ this article shows that LEFT JOIN / NULL takes 3x longer than a NOT IN – Walker Jan 31 '18 at 20:30
I'm not trying to challenge your question. I would just love to understand the reasoning between the two rather than this is better than this. I appreciate your time – Walker Jan 31 '18 at 20:30
Give it a try and see what happens. Each server engine is different. The thing you have to understand about an RDBMS is that fundamentally, it's a bunch of code that tries to automatically figure out the fastest way to get the data described by your SELECT. More often than not, a monkey could outperform the RDBMS engine in this regard. – theMayer Jan 31 '18 at 20:34
I will say, that article you linked was using MSSQL. My first lesson in query optimization was to replace a NOT IN with a LEFT JOIN/IS NULL for a very large table (timestamped data). If I recall correctly, MS tweaked their query optimizer around SQL 2008 to use a more effecient implementation of NOT IN. That still won't take advantage of indices automatically though. And notice how your plan does not have an anti semi-join in it as the article does. – theMayer Jan 31 '18 at 20:58
Looking at the updated plans with the total cost in them. A bit hard for me to understand. Do you have any idea where the optimization efforts should be? By looking at it I do realize that the potential benefits of the left join might be irrelevant because not a lot of cost is there? Correct me if I'm wrong. Thanks – Walker Jan 31 '18 at 21:04
I have never found those numbers on a plan to mean anything at all. The proof is in the pudding. When can you test? Also keep in mind - those numbers are what the query optimizer has come up with - if the query optimizer is wrong (almost certainly the case) then the numbers are wrong. – theMayer Jan 31 '18 at 21:19
I'm testing as we speak. The `LEFT JOIN` seems to bring down the cost of the operation down with the `EXPLAIN ANALYZE` so I imagine that means something. Also checkout my comment on the other thread talking how much it lowered cost even more – Walker Jan 31 '18 at 21:33
I was wondering about the distance. There is probably a more efficient way to do that. See [https://learn.microsoft.com/en-us/sql/relational-databases/spatial/spatial-data-sql-server] for the sql version. not sure about postgres. – theMayer Jan 31 '18 at 21:35
https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01 – theMayer Jan 31 '18 at 21:36

Complex User Matching Algorithm Performance

1 Answers1