2

I’m trying to write the logic for a Query, that will allow me to classify users activities:

• The problem is a table that contain all users activities in slots of ~5 min (not all are exactly 5 min, some 3 min, others 4 min) and record the amount of time that each user spent in a certain status.

• The user usually jump between status across the day.

The issue: If a user spent more than 3 hours (180 min) continuously without a change in their status it must be reported as: “Unclassified”

Current view of the table that I’m working:

user_id record_date user_status
user1   9/3/2017 14:25  status_1
user1   9/3/2017 14:30  status_3
user1   9/3/2017 14:35  status_3
user1   9/3/2017 14:40  status_2
user1   9/3/2017 14:45  status_2
user1   9/3/2017 14:50  status_2
user1   9/3/2017 14:55  status_2
user1   9/3/2017 15:00  status_2
user1   9/3/2017 15:05  status_2
user1   9/3/2017 15:10  status_2
user1   9/3/2017 15:15  status_2
user1   9/3/2017 15:20  status_2
user1   9/3/2017 15:25  status_2
user1   9/3/2017 15:30  status_2
user1   9/3/2017 15:30  status_2
user1   9/3/2017 15:35  status_2
user1   9/3/2017 15:40  status_2
user1   9/3/2017 15:43  status_3
user1   9/3/2017 15:45  status_3
user1   9/3/2017 15:50  status_2
user1   9/3/2017 15:50  status_2
user1   9/3/2017 15:55  status_2
user1   9/3/2017 16:00  status_2
user1   9/3/2017 16:00  status_2
user1   9/3/2017 16:04  status_2

I started testing the following logic, but once I discovered that each slot is not exactly 5 min I was not able to continue.

SELECT user_id ,record_date
            ,CASE 
                  WHEN SUM(status_1) OVER (
                              PARTITION BY user_id ORDER BY record_date ASC ROWS BETWEEN 35 PRECEDING
                                          AND current row
                              ) >= 180
                        THEN 1
                  ELSE 0
                  END AS unclassified_flag

--2
  ,CASE 
                  WHEN SUM(status_2) OVER (
                              PARTITION BY user_id ORDER BY record_date ASC ROWS BETWEEN 35 PRECEDING
                                          AND current row
                                  ) >= 180
                            THEN 1
                      ELSE 0
                      END AS unclassified_flag
FROM table

Any ideas of alternatives logic are truly appreciated

Michelle
  • 202
  • 2
  • 14
  • Just updated the corrected tags. – Michelle Oct 25 '17 at 17:27
  • Postgres is very different to Redshift. Are you really using both? –  Oct 25 '17 at 18:11
  • Is not that different: Amazon Redshift is based on PostgreSQL 8.0.2 – Michelle Oct 25 '17 at 18:29
  • They are very different: http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html "*Do not assume that the semantics of elements that Amazon Redshift and PostgreSQL have in common are identical*" –  Oct 25 '17 at 18:32
  • What is the time period that you are reviewing over? a day? a week? forever? – Jon Scott Oct 25 '17 at 22:02
  • Forever... from 2015 up to date – Michelle Oct 25 '17 at 22:56
  • is the join too slow? afaik with window functions you can't set the threshold for a window definition (unless you add a ton of LAG functions to check every row of the previous ton rows), the threshold is essentially the join condition – AlexYes Oct 28 '17 at 12:43
  • Sample data and output will be helpful as the description is not very clear. – Fact Mar 03 '19 at 23:21

1 Answers1

0

Check out the window function LAG() https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LAG.html

You can add the previous timestamp for each row, allowing you then to simply get the interval time and group by status. The only downside is it'll include times when the user was inactive, so you might want to throw out large things.

eg

with status_intervals as (
SELECT
  user_id, status, record_date
  ,lag(record_date) OVER (PARTITION BY user_id ORDER BY record_date) as last_date
FROM
  table
)
SELECT
  user_id, status
  ,sum(datediff(second, last_date, record_date)) as total_time_in_status
FROM
   status_intervals
WHERE
   datediff(second, last_date, record_date) < 900 --arbitrarily deciding 15min is likely inactive
GROUP BY
   user_id, status
ScottieB
  • 3,958
  • 6
  • 42
  • 60