0

I am running a query that returns the count distinct of active users per different products grouped by day and country. I also added two columns 'previous_day_active_users' and 'returning_users' which return the number of users active in the previous day and the number of users active both in the previous and the current day. I want to do this in order to calculate the retention rate (returning_users / previous_day_active_users). The problem is the numbers don't match. In addition, 'previous_day_active_users' and 'returning_users' always match, which doesn't make sense. I don't understand what I am doing wrong.

Query:

WITH active_users AS (
    SELECT
        date_trunc('day'::text, customer_kpis_hourly.time_eet) as time_eet,
        customer_kpis_hourly.market_code,
        customer_kpis_hourly.user_id,
        customer_kpis_hourly.active,
        LAG(customer_kpis_hourly.active) OVER (PARTITION BY user_id ORDER BY time_eet) AS prev_active,
        customer_kpis_hourly.active_casino,
        customer_kpis_hourly.active_virtual_sports,
        customer_kpis_hourly.active_poker,
        customer_kpis_hourly.active_sportsbook
    FROM delivery.customer_kpis_hourly
    WHERE customer_kpis_hourly.time_eet >= '2021-01-01'
    AND customer_kpis_hourly.time_eet < '2021-01-05'
    AND customer_kpis_hourly.active
),
previous_day_active_users AS (
    -- Calculate previous day's active users
    SELECT
        user_id,
        time_eet - INTERVAL '1 day' AS previous_day
    FROM active_users
    WHERE prev_active
)
SELECT
    'D'::text AS time_gran,
    date_trunc('day'::text, active_users.time_eet) AS time_eet,
    active_users.market_code,
    count(DISTINCT active_users.user_id) FILTER (WHERE active_users.active) AS active_user,
    count(DISTINCT active_users.user_id) FILTER (WHERE active_users.active_casino) AS active_casino,
    count(DISTINCT active_users.user_id) FILTER (WHERE active_users.active_virtual_sports) AS active_vs,
    count(DISTINCT active_users.user_id) FILTER (WHERE active_users.active_poker) AS active_poker,
    count(DISTINCT active_users.user_id) FILTER (WHERE active_users.active_sportsbook) AS active_sb,
    count(DISTINCT previous_day_active_users.user_id) as previous_day_active_user,
    count(DISTINCT CASE WHEN previous_day_active_users.user_id IS NOT NULL THEN active_users.user_id END) AS returning_users
FROM
    active_users
LEFT JOIN
    previous_day_active_users
    ON active_users.user_id = previous_day_active_users.user_id
--     AND date_trunc('day'::text, active_users.time_eet) = previous_day_active_users.previous_day
GROUP BY
    'D'::text, date_trunc('day'::text, active_users.time_eet), active_users.market_code
Steven
  • 19
  • 3
  • Don't forget to add a tag for the dbms used. – jarlh Jul 21 '23 at 10:37
  • Please don’t link to images, add all relevant information to your question as editable text – NickW Jul 21 '23 at 11:21
  • Please update your question to provide table definitions (ddl), sample data, and expected results from that data. All as formatted text -**no images**. Additional benefit would be provide a [fiddle](https://dbfiddle.uk/). You need only include relevant columns. (perhaps just 1 of the FILTERed columns). – Belayer Jul 22 '23 at 19:31

1 Answers1

0

When you compute previous_day_active_users, you basically compute time_eet - INTERVAL '1 day' AS previous_day for the user_id given where

LAG(customer_kpis_hourly.active) OVER (PARTITION BY user_id ORDER BY time_eet) AS prev_active

evaluates as true. But this does not seem to actually check whether the user was really working on the previous day. instead, you will need to apply something like this:

SELECT <your fields>
FROM delivery.customer_kpis_hourly curr
LEFT JOIN delivery.customer_kpis_hourly prev
ON curr.user_id = prev.user_id AND
   prev.time_eet >= '2021-01-03' AND
   prev.time_eet < '2021-01-04'
WHERE
   curr.time_eet >= '2021-01-04' AND
   curr.time_eet < '2021-01-05'

That would create you pairs or curr and prev and you can easily convert this into an aggregate query that counts the total number of curr.user_id as the number of active users and the total of not null prev.user_id as the users active on the previous day. Then you can compute their difference or ratio.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175