0

Here is what I am trying to perform with Presto/AWS Athena. I'd be glad to get your remarks/explanations on the following.

I have a table called activity_table

| member_id | arrays_of_distinct_days_of_activity |

where

  • member_id is VARCHAR and
  • arrays_of_distinct_days_of_activity is an ARRAY of DATE(s)

This table includes some 150 million distinct member_id, and each member_id has it's own unique row.

I am trying to perform:

SELECT member_id,arrays_of_distinct_days_of_activity,dt
FROM activity_table
CROSS JOIN UNNEST(sequence(date '2019-07-01',date '2019-07-15',interval '1' day)) AS T(dt)

but it results in a very long time.

Two questions:

  1. Is it true that the very long calculation time results from Presto/Athena's checking for all combinations of the couple (member_id,arrays_of_distinct_days_of_activity) in order to CROSS JOIN with the UNNEST() array?
  2. If yes, and since each member_id has a row of its own, is there a way to specify to Presto/Athena that the CROSS JOIN should not be performed according to the combinations of (member_id,arrays_of_distinct_days_of_activity), but only by with the column member_id?

Thanks in advance for your answers!

Ronicho
  • 1
  • 1
  • 1
    I don't understand your question. Your CROSS JOIN returns 150,000,000 * 15 rows, i.e. each row once for each day in your sequence. – dnoeth Jul 29 '19 at 20:05
  • yes. Thus I wonder if there is a way to specify to Athena that it does not need to double check this by going through all combinations of of the two columns? which I believe is what takes much time. A means to bypass this would be to cross join with the unnest() and the member_id ONLY and then left join the output table with the activity_table to bring the arrays_of_distinct_days_of_activity column...but it's clearly not elegant and uses two JOINs instead of one. – Ronicho Jul 29 '19 at 21:19
  • But then you join 150,000,000 * 15 rows and 2,250,000,000 * 1 rows for the 2nd join, why should this be faster? – dnoeth Jul 29 '19 at 21:55
  • My assumption is that the time consuming operation is: check the unique combinations between the two columns before CROSS JOIN. Is my hypothesis Wrong? If not, I want to know if I can specify directly (with a command) according to which column the CROSS JOIN should be operated. – Ronicho Jul 30 '19 at 07:41
  • I don't even understand your hypothesis. A Cross Join doesn't care about uniqueness or the number of columns, it simply returns all possible combinations. – dnoeth Jul 30 '19 at 07:47
  • Your remark partially answers me. SO there no such step as: calculating all possible combinations between the two columns before making the cross product. Thanks! – Ronicho Jul 30 '19 at 11:16
  • UNNEST performance was improved in Presto 316 (https://github.com/prestosql/presto/pull/901). While Athena is based on Presto .172, you can easily run Presto on AWS yourself or use Starburst Presto service for AWS: https://www.starburstdata.com/presto-aws-cloud/. (I'm from Starburst.) – Piotr Findeisen Jul 30 '19 at 11:31

0 Answers0