0

For example, I have 2 tables like this

![enter image description here

For every row in table 1, I want to get the row with

  • same customer id and

  • nearest date (in my case, table2.date < table1.date)

The result should look like this

enter image description here

How can I do this in SQL? I tried to search, but didn't find many things related. and what if I want to change the condition to <=, >= or >?

Thanks!

Note:

  • please using standard SQL, as I'm not using PostgreSQL
  • If I can use Python, I might simply get away with a for-loop.
ZK Zhao
  • 19,885
  • 47
  • 132
  • 206

2 Answers2

3

You can use a lateral join for this:

select t1.*, t2.*  -- choose the columns you want
from table1 t1 left join lateral
     (select t2.
      from table2 t2
      where t2.customer_id = t1.customer_id and
            t2.date < t1.date  -- do you really mean <= ?
      order by t2.date desc
      limit 1
     ) t2;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Sadly, I'm using `presto`, it seems it does not support lateral join, is there any other way? – ZK Zhao Oct 21 '19 at 14:39
  • 1
    @cqcn1991 . . . When I look at the documentation it does: https://prestosql.io/docs/current/sql/select.html. – Gordon Linoff Oct 21 '19 at 16:57
  • @cqcn1991 Presto supports lateral join for long time, but recently we did some improvements in this area, so I recommend trying out Presto 322 when more complex lateral joins are in play. – Piotr Findeisen Oct 21 '19 at 17:47
  • @PiotrFindeisen . . . Is that a performance statement? Or would this query not work in recent (but older) versions of Presto? – Gordon Linoff Oct 21 '19 at 19:24
  • @GordonLinoff it's about whether a query can be decorrelated (Presto executes `LATERAL JOIN` by decorrelating into equivalent `JOIN`). TopN decorrelation was implemented only this year. Incidentally, just as the `FETCH FIRST` syntax you used (Presto used to support non-ANSI `LIMIT` only). If you want to stay current, join the https://prestosql.io/slack.html and watch for the release announcements. See you there! – Piotr Findeisen Oct 21 '19 at 20:14
  • @GordonLinoff but it's not prestodb. They're different. – ZK Zhao Oct 22 '19 at 00:56
3

I agree with Gordon - lateral join is best solution.

If lateral join is not supported, window function-based solution may come in handy.

with table1 (order_id, customer_id, date) as (
  select 1, 1, date '2019-10-10' union
  select 2, 1, date '2019-10-11' union
  select 3, 2, date '2019-10-11' union
  select 4, 2, date '2019-10-12' union
  select 5, 3, date '2019-10-12'
), table2 (order_id, customer_id, date) as (
  select  8, 1, date '2019-10-08' union
  select  9, 1, date '2019-10-09' union
  select 10, 1, date '2019-10-10' union
  select 11, 2, date '2019-10-10' union
  select 11, 2, date '2019-10-10' union
  select 11, 2, date '2019-10-10'
), all_rows as (
  select t1.*, t2.*, row_number() over (partition by t1.order_id order by t2.date desc) rn
  from table1 t1
  left join table2 t2 on t1.customer_id = t2.customer_id and t2.date < t1.date
)
select * from all_rows where rn = 1
Tomáš Záluský
  • 10,735
  • 2
  • 36
  • 64
  • An additional question, is it possible to add rn=1 to the previous query? feels a bit redundant when using it seperately – ZK Zhao Oct 29 '19 at 11:39
  • To which query? `rn` is computed based on your requirements in `all_rows` CTE. It is redundant in the result since it's always 1, so if it minds, you have to explicitly list columns from `all_rows` instead of `*` (some dbs also support `select * except` construct) – Tomáš Záluský Oct 29 '19 at 16:57