2

I have a table "days_users" with dates and users (one row per date and user), from where I could extract the information of which particular days a user was seen, or which users were seen on a particular day:

2023-01-01,user1
2023-01-01,user2
2023-01-01,user3
2023-01-02,user2
2023-01-02,user4
2023-01-03,user1
2023-01-03,user4

I need to compute how many new users appear/disappear each day, and I don't know how to do it. A user "appear" on one day means that the user is seen on that day, but was never seen before, and a user "disappear" on one day means that the user was seen on that day but was never seen after that date.

I thought that a way to start is to create a view of users, first_date_seen, last_date_seen as follows:

user_first_last AS (
SELECT user, min(date) AS first_date_seen, max(date) AS last_date_seen FROM days_users
GROUP BY 1
)

And then count all the users that appear from a particular date until the end

SELECT date, COUNT(DISTINCT user) as num_appearing_users 
FROM user_first_last WHERE first_date_seen = {date} AND last_date_seen = '2023-03-01' 
GROUP BY 1
ORDER BY 1

And similarly for the disappearing users

SELECT date, COUNT(DISTINCT user) as num_disappearing_users 
FROM user_first_last WHERE first_date_seen = '2023-01-01' AND last_date_seen = {date} 
GROUP BY 1
ORDER BY 1

But please note the {date} between curly braces: I would like this date to be the same as the date in the query, i. e. both dates in bold should be the same:

SELECT 
    **date**, COUNT(DISTINCT user) AS num_disappearing_users 
FROM user_first_last 
WHERE first_date_seen = '2023-01-01' 
  AND last_date_seen = **date** 

How can I achieve this?

GMB
  • 216,147
  • 25
  • 84
  • 135
ElTitoFranki
  • 375
  • 1
  • 7
  • What DBMS is being used? And do you need rows for dates that are not in you table (for example if no date for 2023-01-02 is present in you table, should in be in result dataset?)? – markalex Apr 25 '23 at 16:14
  • I do not know which DBMS is being used, is a query to execute in AWS Athena. If there are dates with no info, it's not necessary to appear in the result – ElTitoFranki Apr 25 '23 at 16:17
  • All dates in the table "days_users" – ElTitoFranki Apr 25 '23 at 16:43

2 Answers2

1

You can do this:

with dates as(
select distinct date from days_users),
usg as (
  select min(date) first_date,
    max(date) last_date,
    user
    from days_users
    GROUP BY user
)
select date,
  (select count(user)
   from usg
   where usg.last_date=dates.date
  ) never_seen_after,
  (select count(user)
   from usg
   where usg.first_date=dates.date
  ) never_seen_before
from dates 

Here we use CTEs:

  • dates - to get list of dates present in data,
  • usg - list of users with their first and last day.
markalex
  • 8,623
  • 2
  • 7
  • 32
1

If I follow you correctly, we can do this with window functions and aggregation:

select date, 
    sum(case when date = min_date then 1 else 0 end) as num_appearing_users,
    sum(case when date = max_date then 1 else 0 end) as num_disappearing_users
from (
    select u.*, 
        min(date) over(partition by user) min_date, 
        max(date) over(partition by user) max_date 
    from day_users u 
    group by user
) u
group by date

On each row, the subquery brings the earliest and latest appearance of the user; then, the outer query aggregates by date, and counts how many users appeared or disappeared on each date.

More than often, a solution that uses window functions is more efficient than alternatives using CTEs, subqueries or self-joins - that is because it typically scans the table only once, as opposed to multiple scans with other solutions. Window functions should almost always be preferred whenever they can solve the problem.

GMB
  • 216,147
  • 25
  • 84
  • 135