1

i have data with 3 columns: date, id, sales. my first task is filtering sales above 100. i did it. second task, grouping id by consecutive days.

index date id sales
0 01/01/2018 03 101
1 01/01/2018 07 178
2 02/01/2018 03 120
3 03/01/2018 03 150
4 05/01/2018 07 205

the result should be:

index id count
0 03 3
1 07 1
2 07 1

i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem. just for effort, i tried the suggestion for a solution here count consecutive days python dataframe but the ids' not grouped. here is my code:

data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)

it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question. Thank you a lot

paul
  • 59
  • 5

1 Answers1

1

Your code is close, but need some fine-tuning, as follows:

data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)

df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')

Result:

print(new_frame)

   id  count
0   3      3
1   7      1
2   7      1

Summary of changes:

  1. As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.

  2. We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.

  3. In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.

  4. In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • SeaBean you are a genius, thank you a lot. may you explain to me the need for ne(1) function, what the purpose of this function in the line? thank you – paul Sep 11 '21 at 10:42
  • @paul [`.ne()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ne.html) is the function to test `not equal`, that, is `!=`. `ne(1)` is to test not equal to 1 for the date difference between 2 consecutive entries. Since you want a new group when the date difference is more than 1, we use this test so that whenever the condition is True, (ie. date difference > 1), a new group number is generated by the `.cumsum()` function. – SeaBean Sep 11 '21 at 11:34
  • @paul Note that the solution you based on is checking consecutive day `not equal` to 1 to generate new group (hence new count). Beware of the possibility of your data that the date difference could be `0`. In such case, the diff is 0 which is also `ne(1)`. So, see whether it is what you want. You can also consider to change it to `gt(1)` for the case strictly > 1. – SeaBean Sep 11 '21 at 11:37