0

I have a dataset contains 'UI' (unique id), time, frequency (frequency for give value in UI column), as it is shown here:

enter image description here

What I would like to add a new column named 'daily_frequency' which simply counts each unique value in UI column for a given day sequentially as I show in the image below.

enter image description here

For example if UI=114737 and it is repeated 2 times in one day, we should have 1, and 2 in the daily_frequency column.

I could do that with Python and Panda package using group by and cumcount methods as follows ...

df['daily_frequency'] = df.groupby(['UI','day']).cumcount()+1

However, for some reason, I must do this via SQL queries (Amazon Redshift).

Amin Kiany
  • 722
  • 8
  • 17

2 Answers2

2

I think you want a running count, which could be calculated as:

COUNT(*) OVER (PARTITION BY ui, TRUNC(time) ORDER BY time
               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS daily_frequency
Salman A
  • 262,204
  • 82
  • 430
  • 521
  • Thanks Salman for quick reply. I have tried what you suggested, however, I got the following error message `[Amazon](500310) Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause;` ... – Amin Kiany Jul 05 '19 at 12:49
  • @AminKiany . . . That is something of a peculiarity of Redshift. – Gordon Linoff Jul 05 '19 at 12:50
  • @SalmanA: Thanks Salman for the answer. I also edited my post. – Amin Kiany Jul 05 '19 at 13:01
0

Although Salman's answer seems to be correct, I think ROW_NUMBER() is simpler:

COUNT(*) OVER (PARTITION BY ui, time::date
               ORDER BY time
              ) AS daily_frequency
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786