2

Sample data

CREATE TEMP TABLE a AS
SELECT id, adate::date, name
FROM ( VALUES 
  (1,'1/1/1900','test'),
  (1,'3/1/1900','testing'),
  (1,'4/1/1900','testinganother'),
  (1,'6/1/1900','superbtest'),
  (2,'1/1/1900','thebesttest'),
  (2,'3/1/1900','suchtest'),
  (2,'4/1/1900','test2'),
  (2,'6/1/1900','test3'),
  (2,'7/1/1900','test4')
) AS t(id,adate,name);

CREATE TEMP TABLE b AS
SELECT id, bdate::date, score
FROM ( VALUES
  (1,'12/31/1899', 7 ),
  (1,'4/1/1900'  , 45), 
  (2,'12/31/1899', 19), 
  (2,'5/1/1900'  , 29), 
  (2,'8/1/1900'  , 14)
) AS t(id,bdate,score);

What I want

What I need to do is aggregate column text from table a where the id matches table b and the date from table a is between the two closest dates from table b. Desired output:

id  date    score   textagg
1   12/31/1899  7   test, testing
1   4/1/1900    45  testinganother, superbtest
2   12/31/1899  19  thebesttest, suchtest, test2
2   5/1/1900    29  test3, test4
2   8/1/1900    14  

My thoughts are to do something like this:

create table date_join
select a.id, string_agg(a.text, ','), b.*
from tablea a
left join tableb b
on a.id = b.id
*having a.date between b.date and b.date*;

but I am really struggling with the last line, figuring out how to aggregate only where the date in table b is between the closest two dates in table b. Any guidance is much appreciated.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
James Steele
  • 645
  • 1
  • 6
  • 22
  • The schema with repeating values in `id` doesn't make much sense. Having said that, check out the `tsrange` datatype and see if you can leverage it for this purpose – mike.k Jan 13 '17 at 04:30
  • Your example is broken, `superbtest` is not *between* matching dates in table `b`. Also table definitions and Postgres version are missing. – Erwin Brandstetter Jan 13 '17 at 04:53
  • Note that the result you want is not in first normal form. Isn't it better for you to have something like (id, date, score, text) with 9 rows of output? – Fabian Pijcke Jan 13 '17 at 06:43
  • @ErwinBrandstetter you are correct, one thing that I forgot to specify is that for the last date, I want all the text before it. – James Steele Jan 13 '17 at 13:42
  • @FabianPijcke normally yes, but in this case I am going to export the table for a different pupose. – James Steele Jan 13 '17 at 13:43

1 Answers1

2

I can't promise it's the best way to do it, but this is a way to do it.

with b_values as (
  select
    id, date as from_date, score,
    lead (date, 1, '3000-01-01')
      over (partition by id order by date) - 1 as thru_date
  from b
)
select
  bv.id, bv.from_date, bv.score,
  string_agg (a.text, ',')
from
  b_values as bv
  left join a on
    a.id = bv.id and
    a.date between bv.from_date and bv.thru_date
group by
  bv.id, bv.from_date, bv.score
order by
  bv.id, bv.from_date

I'm presupposing you will never have a date in your table greater than 12/31/2999, so if you're still running this query after that date, please accept my apologies.

Here is the output I got when I ran this:

id  from_date   score   string_agg
1   0           7       test,testing
1   92          45      testinganother,superbtest
2   0           19      thebesttest,suchtest,test2
2   122         29      test3,test4
2   214         14  

I might also note that between in a join is a performance killer. IF you have large data volumes, there might be better ideas on how to approach this, but that depends largely on what your actual data looks like.

Hambone
  • 15,600
  • 8
  • 46
  • 69
  • You can reduce the chances to have to apologize by using lag() with a date like 1800-01-01. Of course you're not immune to the invention of time travel, but there your query should not exist yet, unless it is used to run the time travel machine (there you're in trouble man!) – Fabian Pijcke Jan 13 '17 at 06:56
  • I've added sample data to the question that may or may not work with your data. It would be great in the future though if you would update the question itself with DDL you used to get the environment up and working. – Evan Carroll Jan 13 '17 at 09:11
  • 1
    This is generally pretty good, you can use `tsrange` if you want to precalculate the range and find when dates sit inside. That's a slightly syntax. However, not likely faster. You can also put 'infinity'::date in the top party, just to get out the bizzare 3000-01-01 lol. – Evan Carroll Jan 13 '17 at 09:55
  • I completely forgot about infinity... that's a great option. – Hambone Jan 13 '17 at 13:03
  • @Hambone I will try this when I have access to my machine again later this evening. For reference, table a has ~318k rows, and table b has 66k rows. This is a one time export so depending on the performance I may not have to seek an alternative to the `between`. – James Steele Jan 13 '17 at 13:55