1

I have a dataset for which I have to conditionally count rows from table B that are between two dates in table A. I have to do this without the use of a correlated subquery in the SELECT clause, as this is not supported in Netezza - docs: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.0.3/com.ibm.nz.dbu.doc/c_dbuser_correlated_subqueries_ntz_sql.html.

Background on tables: Users can log in to a site (logins). When they log in, they can take actions, which are in (actions_taken). The desired output is a count of rows that are between the actions_taken action_date and lag_action_date.

Data and attempt found here: http://rextester.com/NLDH13254

Table: actions_taken (with added calculations - see RexTester.)

| user_id | action_type   | action_date | lag_action_date | elapsed_days |
|---------|---------------|-------------|-----------------|--------------|
| 12345   | action_type_1 | 6/27/2017   | 3/3/2017        | 116          |
| 12345   | action_type_1 | 3/3/2017    | 2/28/2017       | 3            |
| 12345   | action_type_1 | 2/28/2017   | NULL            | NULL         |
| 12345   | action_type_2 | 3/6/2017    | 3/3/2017        | 3            |
| 12345   | action_type_2 | 3/3/2017    | 3/25/2016       | 343          |
| 12345   | action_type_2 | 3/25/2016   | NULL            | NULL         |
| 12345   | action_type_4 | 3/6/2017    | 3/3/2017        | 3            |
| 12345   | action_type_4 | 3/3/2017    | NULL            | NULL         |
| 99887   | action_type_1 | 4/1/2017    | 2/11/2017       | 49           |
| 99887   | action_type_1 | 2/11/2017   | 1/28/2017       | 14           |
| 99887   | action_type_1 | 1/28/2017   | NULL            | NULL         |

Table: logins

| user_id | login_date |
|---------|------------|
| 12345   | 6/27/2017  |
| 12345   | 6/26/2017  |
| 12345   | 3/7/2017   |
| 12345   | 3/6/2017   |
| 12345   | 3/3/2017   |
| 12345   | 3/2/2017   |
| 12345   | 3/1/2017   |
| 12345   | 2/28/2017  |
| 12345   | 2/27/2017  |
| 12345   | 2/25/2017  |
| 12345   | 3/25/2016  |
| 12345   | 3/23/2016  |
| 12345   | 3/20/2016  |
| 99887   | 6/27/2017  |
| 99887   | 6/26/2017  |
| 99887   | 6/24/2017  |
| 99887   | 4/2/2017   |
| 99887   | 4/1/2017   |
| 99887   | 3/30/2017  |
| 99887   | 3/8/2017   |
| 99887   | 3/6/2017   |
| 99887   | 3/3/2017   |
| 99887   | 3/2/2017   |
| 99887   | 2/28/2017  |
| 99887   | 2/11/2017  |
| 99887   | 1/28/2017  |
| 99887   | 1/26/2017  |
| 99887   | 5/28/2016  |

DESIRED OUTPUT: cnt_logins_between_action_dates field

| user_id | action_type   | action_date | lag_action_date | elapsed_days | cnt_logins_between_action_dates |
|---------|---------------|-------------|-----------------|--------------|---------------------------------|
| 12345   | action_type_1 | 6/27/2017   | 3/3/2017        | 116          | 5                               |
| 12345   | action_type_1 | 3/3/2017    | 2/28/2017       | 3            | 4                               |
| 12345   | action_type_1 | 2/28/2017   | NULL            | NULL         | 1                               |
| 12345   | action_type_2 | 3/6/2017    | 3/3/2017        | 3            | 2                               |
| 12345   | action_type_2 | 3/3/2017    | 3/25/2016       | 343          | 7                               |
| 12345   | action_type_2 | 3/25/2016   | NULL            | NULL         | 1                               |
| 12345   | action_type_4 | 3/6/2017    | 3/3/2017        | 3            | 2                               |
| 12345   | action_type_4 | 3/3/2017    | NULL            | NULL         | 1                               |
| 99887   | action_type_1 | 4/1/2017    | 2/11/2017       | 49           | 8                               |
| 99887   | action_type_1 | 2/11/2017   | 1/28/2017       | 14           | 2                               |
| 99887   | action_type_1 | 1/28/2017   | NULL            | NULL         | 1                               |
psrpsrpsr
  • 457
  • 1
  • 4
  • 12
  • Why did you add the `postgresql` tag if you are using Netezza? –  Jun 29 '17 at 16:26
  • I understood PostgreSQL to be the supported language that Netezza interprets. I thought Netezza was the data warehouse hardware appliance through which the supported SQL syntax is interpreted. No? Clarity is much appreciated here, as I a simple data analyst. Shall I remove the tag? – psrpsrpsr Jun 29 '17 at 16:30
  • No, not at all. Those are two very different database systems. Both use SQL as their query language - maybe that's what you were thinking about –  Jun 29 '17 at 16:32
  • @a_horse_with_no_name . . . Netezza is built off of Postgres's code base, as are many databases (https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases). The Netezza code base is very old, so it is not so related to the more recent versions of Postgres. – Gordon Linoff Jun 29 '17 at 16:45
  • Thank you for the input folks. Any thoughts on how to achieve the desired output? Thank you – psrpsrpsr Jun 29 '17 at 18:48

1 Answers1

2

You don't need a correlated sub-query. Get the previous date using lag and join the logins table to count the actions between dates.

with prev_dates as (select at.*
                    ,coalesce(lag(action_date) over(partition by user_id,action_type order by action_date)
                              ,action_date) as lag_action_date 
                    from actions_taken at
                   )
select at.user_id,at.action_type,at.action_date,at.lag_action_date
,at.action_date-at.lag_action_date as elapsed_days
,count(*) as cnt
from prev_dates at
join login l on l.user_id=at.user_id and l.login_date<=at.action_date and l.login_date>=at.lag_action_date
group by at.user_id,at.action_type,at.action_date,at.lag_action_date
order by 1,2,3
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
  • Brilliant! So the process is: 1.) create a CTE that coalesces null lag values with the action date 2.) inner join tbl A to tbl B, so that desired rows are now in a single table 3.) count(*) and group. Much appreciated! – psrpsrpsr Jun 29 '17 at 20:00