1

context

I'm playing around with PostgreSQL's lateral join, specifically to perform a join on a group by/limit.

The query works really well when I'm looking up individual records, but the performance very quickly degrades as we query multiple records. This makes sense given that we have multiple subqueries running individual gather, filter aggregate, sorts.the question is, what Postgres strategy should we look at or how do we refactor the below query in order to make it performant at scale?

Query

We have three main tables with a junction table between two of them:

|Managers| >- |Stores| >- |Store_Products| -< Products

We have all of the historical managers for a given store record, and we have a whole catalogue of products for stores (products may be carried by multiple stores).

Goal:given a Store ID, query the most recent Manager and the most recent Product sold.

It's an inner join from Store to Manager and to Product. Manager & Product must be sorted by date desc and limited to 1 (at least I believe that is the way to get the most recent).

SELECT 
    store.id as store_id,
    manager.id as manager_id,
    *
FROM 
    Stores as store,
    LATERAL (
        SELECT 
            * 
        FROM 
            Products as product 
        INNER JOIN Stores_Products store_product on store_product.product_id = product.id
        WHERE 
            store_product.store_id = store.id
        ORDER BY 
            store.date desc
        LIMIT 1
    ) p,
    LATERAL (
        SELECT 
            * 
        FROM 
            Managers as manager
        WHERE 
            manager.store_id = store.id 
        ORDER BY
            manager.date desc
        LIMIT 1
        ) m
WHERE store.name = 'ABC retail'

This works perfectly when you query a single store. However, if you try to batch query (e.g. WHERE store.name in [...]), the query gets very slow and memory consuming very quickly.

Question

Is there a better way to query the data that would scale well?

Thank you!

Note: the example given with stores/products is simply a device to communicate the problem. The actual schema is different - so I'd ask not to not put too much thought into whether this is the best way to normalize the schema! Thank you !

ZAR
  • 2,550
  • 4
  • 36
  • 66
  • For the record, psql is a command line client for PostgreSQL. The question, however, is not about psql, it is about PostgreSQL itself. – Dmitry Dec 06 '19 at 00:53
  • Huh... I always assumed psql (in addition to being the CLI tool) was just shorthand for postgresql... – ZAR Dec 06 '19 at 00:54
  • Well, despite that it is used as a shorthand, it clearly makes confusion, because psql is a separate utility with its own (quite extensive!) list of commands. – Dmitry Dec 06 '19 at 01:04
  • The name is either Postgres, PostgreSQL or if you want a short name "pg". I removed the confusing references to psql –  Dec 06 '19 at 06:58
  • Why the lateral join to begin with? If you want to get the result for multiple stores, I would assume a plain inner join with a WHERE condition on the stores would be more efficient –  Dec 06 '19 at 07:01
  • @a_horse_with_no_name the reason is because I want to have more control over which result from Product and Manager are returned. Perhaps I am wrong, but I couldn't figure out how a simple inner join+where would allow me to order->limit 1? I guess the WHERE could be a sub query, i.e. on manager.store_id WHERE manager.id in (SELECT...), but I'm not sure if this would help performance? Have I understood your suggestion? – ZAR Dec 06 '19 at 12:24
  • @ZAR what you actually want is the argmin of the most recent date for each store :) It is possible to achieve without lateral joins, window functions and all that fancy stuff with simple inner joins and group by. Admittedly, fancy stuff may make it more readable. – Dmitry Dec 06 '19 at 13:03
  • @ZAR as a side question, what if you have two most recent managers or two most recent products sold for some store? I mean, with exactly the same date. Do you want to return just any of them or do you want to return all? – Dmitry Dec 06 '19 at 13:05
  • @Dmitry great question - let's assume we want to (have the ability to) sort on a second column as well in these cases. – ZAR Dec 06 '19 at 13:07
  • Shouldn't the first query order by store_product.date? Ordering by store.date doesn't have much sense to me. – jjanes Dec 06 '19 at 21:22

1 Answers1

4

Perhaps window functions will work faster. In the code below product ordering attribute is left as ... because in your snippet they seem to be ordered by store.date which looks wrong (it is the property of the store, not of the product and not of the product sold in the store).

SELECT * FROM 
-- Let's rank managers within each store, giving rank=1 to the most recent
(
  SELECT id, 
         store_id, 
         RANK() OVER (PARTITION BY store_id ORDER BY date DESC) AS mgr_rank
  FROM Manager
) AS MgrRank 

JOIN

-- Let's rank products within each store, giving rank=1 to the most recent
(
  SELECT store_id,
         Products.*
         RANK() OVER (PARTITION BY store_id ORDER BY .... DESC) AS product_rank
  FROM Stores_Products JOIN Products ON product_id = Products.id
) AS ProductRank
USING(store_id) 

-- Now let's join stores themselves
JOIN Stores ON store_id = Stores.id

-- Select most recent manager and product
WHERE mgr_rank=1 AND product_rank=1 AND Stores.name='ABC retail'

Keep in mind that this particular query will not output stores without managers or products. You need to use outer joins to include them as well.

Dmitry
  • 3,740
  • 15
  • 17
  • Thank you! Let me try this out at scale and see what the performance is like. Will comment/mark then. Thank you @Dmitry – ZAR Dec 06 '19 at 00:56
  • btw, the question which should've been asked first. Do you have indexes on store_id columns? If there are no indexes, try adding them first. – Dmitry Dec 06 '19 at 01:14
  • 1
    your query was much, much faster at large scale. You have my check and upvote. Thank you! – ZAR Dec 06 '19 at 19:06