2

I have a dataframe that looks like this:

Datetime   | Category | ID
--------------------------
2020-01-30 | A        | 1
2020-02-01 | B        | 1
2020-02-02 | A        | 1
2020-02-20 | A        | 1
2020-01-28 | B        | 2
2020-01-29 | C        | 2
2020-01-30 | C        | 2
2020-01-31 | D        | 2
2020-02-01 | D        | 2
2020-02-02 | D        | 2
2020-02-03 | C        | 2

I would like to obtain the top 2 most frequent categories for each ID within a 1 week window of the row (excluding the current row). Is that possible in pandas? I tried to do .rolling and .value_counts but it doesnt seem to work. Thanks!

Below is the dataframe which I would like to get:

Datetime   | Category | ID
--------------------------
2020-01-30 | NaN      | 1
2020-02-01 | [A, ""]  | 1
2020-02-02 | [A, B]   | 1
2020-02-20 | NaN      | 1
2020-01-28 | Nan      | 2
2020-01-29 | [B,""]   | 2
2020-01-30 | [B,C]    | 2
2020-01-31 | [B,C]    | 2
2020-02-01 | [C,D]    | 2
2020-02-02 | [C,D]    | 2
2020-02-03 | [C,D]    | 2

Thank you!

Edit The pd.get_dummies answer is great but as my dataset is huge, it is not efficient. Would really appreciate if anyone has an efficient solution for this! Thank you!

yxc
  • 71
  • 6

1 Answers1

1

You can use resample() rather than rolling(), because your time index frequency is daily and you want weekly stats, so try something like this:

df.groupby('ID').resample('1w').apply(lambda s: s.value_counts().head(2))

Note that this only works in versions of Pandas where apply() strips the data into pd.Series, not np.arrays. Also if you have more columns in your dataframe it may be necessary to specify the column name in the lambda functions, ie.:

df.groupby('ID').resample('1w').apply(lambda s: s['Category'].value_counts().head(2))

And if you need to exclude the first row in the window use iloc[] slicing:

df.groupby('ID').resample('1w').apply(lambda s: s['Category'].iloc[1:].value_counts().head(2))
mac13k
  • 2,423
  • 23
  • 34
  • thank you so much! but as some of my data is actually my frequency is not consistent (like ID 1 could have rows of data with more than 2 weeks apart), I would not be able to use shift in this case right? – yxc May 27 '20 at 13:48
  • Shift will work regardless of the inconsistencies in the index - it will just move all rows by 1. However if you have such big gaps it may cause some issues in the results. – mac13k May 27 '20 at 14:17
  • OK, maybe `shift` was a bad idea. I made a change. – mac13k May 27 '20 at 14:20
  • thanks for the clarifications! tried to implement this but I am getting the error: AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'value_counts' when I try to do the .apply – yxc May 28 '20 at 05:24
  • Which version of Pandas do you have installed? Mine is 1.0.3. You have to upgrade to the version that supports series inside apply(). I wrote that in the answer. – mac13k May 28 '20 at 06:02
  • Yupp, I checked the version. Mine is 1.03 as well. – yxc May 28 '20 at 06:22
  • Do you specify the column? Ie. `s['Category']...` – mac13k May 28 '20 at 06:24
  • yupp did the exact same one as yours hmm – yxc May 28 '20 at 06:44
  • If you are getting such error it means the object you are calling the `value_counts()` method is not `pd.Series`. You can verify it with `type` by: `df.groupby('ID').resample('1w').apply(type)` or `df.groupby('ID').resample('1w').apply(lambda s: type(s['Category']))` – mac13k May 28 '20 at 06:55