2

I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column

Example data (edited to reflect my real dataset):

group  |        date          |   amount
-------------------------------------------
A      |  2017-12-26 04:20:20 |    50000.0
A      |  2018-01-17 00:54:15 |    60000.0
A      |  2018-01-27 06:10:12 |   150000.0
A      |  2018-02-01 01:15:06 |   100000.0
A      |  2018-02-11 05:05:34 |   150000.0
A      |  2018-03-01 11:20:04 |   150000.0
A      |  2018-03-16 12:14:01 |   150000.0
A      |  2018-03-23 05:15:07 |   150000.0
A      |  2018-04-02 10:40:35 |   150000.0

group by group then sum based on date-7 < date < date+7

Results that I want:

group  |        date          |   amount    |  grouped_sum
-----------------------------------------------------------
A      |  2017-12-26 04:00:00 |    50000.0  |    50000.0
A      |  2018-01-17 00:00:00 |    60000.0  |    60000.0
A      |  2018-01-27 06:00:00 |   150000.0  |   250000.0
A      |  2018-02-01 01:00:00 |   100000.0  |   250000.0
A      |  2018-02-11 05:05:00 |   150000.0  |   150000.0
A      |  2018-03-01 11:00:04 |   150000.0  |   150000.0
A      |  2018-03-16 12:00:01 |   150000.0  |   150000.0
A      |  2018-03-23 05:00:07 |   100000.0  |   100000.0
A      |  2018-04-02 10:00:00 |   100000.0  |   100000.0

Quick snippet to achieve the dataset:

group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15', 
                       '2018-01-27 06:10:12', '2018-02-01 01:15:06', 
                       '2018-02-11 05:05:34', '2018-03-01 11:20:04', 
                       '2018-03-16 12:14:01', '2018-03-23 05:15:07', 
                       '2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0, 
          150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})

Bit of explanation:

  • 2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
  • 4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
  • 6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.

I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:

1.Create another column that shows date-7 and date+7 for each rows

group  |    date     |  amount  |    date-7    |    date+7 
-------------------------------------------------------------
A      |  2017-12-26 |  50000.0 |  2017-12-19  |  2018-01-02
A      |  2018-01-17 |  60000.0 |  2018-01-10  |  2018-01-24

2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()

3.But this method is quite tedious.

EDIT (2018-09-01): Found out this method below based on @jezrael answer which works for me but only works for single group:

t = pd.Timedelta(7, unit='d')
def g(row):
    res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
    return res

df['new'] = df.apply(g, axis=1)
addicted
  • 2,901
  • 3
  • 28
  • 49

2 Answers2

5

Here is problem need loop for each row and for each groups:

t = pd.Timedelta(7, unit='d')

def f(x):
    return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t, 
                                                     y['date'] + t,
                                                     inclusive=False),'amount'].sum() ,axis=1)

df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
  group       date  amount   new
0     A 2018-01-01      10  10.0
1     A 2018-01-14      20  40.0
2     A 2018-01-15      20  40.0
3     B 2018-02-03      10  30.0
4     B 2018-02-04      10  30.0
5     B 2018-02-05      10  30.0

Thanks for improvement by @jpp:

def f(x, t):
    return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t, 
                                                     y['date'] + t,
                                                     inclusive=False),'amount'].sum(),axis=1)

df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))

Verify solution:

t = pd.Timedelta(7, unit='d')


df = df[df['group'] == 'A']

def test(y):
    a = df.loc[df['date'].between(y['date'] - t,  y['date'] + t,inclusive=False)]
    print (a)
    print (a['amount'])
    return a['amount'].sum()

  group       date  amount
0     A 2018-01-01      10
0    10
Name: amount, dtype: int64
  group       date  amount
1     A 2018-01-14      20
2     A 2018-01-15      20
1    20
2    20
Name: amount, dtype: int64
  group       date  amount
1     A 2018-01-14      20
2     A 2018-01-15      20
1    20
2    20
Name: amount, dtype: int64

df['new'] = df.apply(test,axis=1)
print (df)
  group       date  amount  new
0     A 2018-01-01      10   10
1     A 2018-01-14      20   40
2     A 2018-01-15      20   40
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • not sure why it works in this data set but it does not work in my original dataset. not all the results are accurate. – addicted Aug 31 '18 at 12:55
  • @addicted - Can you explain more? Is possible add some data sample where it failed? – jezrael Aug 31 '18 at 12:56
  • I am formulating the additional data sample. – addicted Aug 31 '18 at 13:09
  • @addicted - I think I find problem, `between` by default is inclusive - `<=` and `<=` – jezrael Aug 31 '18 at 13:10
  • It's not that `inclusive`, it's more like it is counting the same data row twice. BTW, still, cool codes. Why is there lambda y? – addicted Aug 31 '18 at 13:15
  • 1
    Nice solution. In my opinion, it's a good idea to pass `t` as a parameter to `f()`, so it's more clear what's happening. `GroupBy.apply` accepts `args` and `kwargs`. – jpp Aug 31 '18 at 13:16
  • @addicted - Yes, try explain how workign my solution. For each group is called lambda function which iterate by each row of group and if find all `amount` by condition, but only per group. Is it what you need? – jezrael Aug 31 '18 at 13:17
  • @jezrael that seems like what I need. But the result shows that it is counting the same amount from same row twice whenever the row meets the condition. – addicted Aug 31 '18 at 13:22
  • @jpp, pandas' apply only allow 1 argument :( – addicted Aug 31 '18 at 13:24
  • @addicted - There was wrong `sum`, ids necessary after `,'amount'].sum()` – jezrael Aug 31 '18 at 13:29
  • 1
    @addicted - added testing solution - filter only one group and print filtered values. – jezrael Aug 31 '18 at 13:41
  • @jezrael, I tried to change the numbers in my problem set and I am getting correct answer on that problem set. However my real problem set still gives me wrong answer. I am actually using `Timestamp` data type for `date` column in my real dataset, just FYI. I will need to verify again first. – addicted Sep 01 '18 at 04:01
  • @addicted - so there are times too like `2018-01-01 12:02:20` ? Then need `df['date'] = df['date'].dt.floor('d')` for `2018-01-01`. – jezrael Sep 01 '18 at 05:09
  • @jezrael I have edited the dataset to reflect closer to the real dataset (but it is still anonymized and pseudonymized). I found a method that works based on your answer. I added this method to my question above. Basically your method gives me 300000.0 and 200000.0 for 3rd and 4th row. – addicted Sep 01 '18 at 07:50
  • sorry @jezrael, my method only works for single group. – addicted Sep 01 '18 at 08:02
  • 1
    @addicted I have only one idea, what is problem - try sorting groups in real data by `df =df. sort_values('group')` . Is possible anonymize real dataset, e. g. by [this](https://stackoverflow.com/a/49321470/2901002) and send me dataset with wrong output to my email from profile if still problem and verifying solution not working? – jezrael Sep 01 '18 at 09:50
0

Add column with first days of the week:

df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)

Result:

  group       date  amount week_start
0     A 2018-01-01      10 2017-12-26
1     A 2018-01-14      20 2018-01-09
2     A 2018-01-15      20 2018-01-09
3     B 2018-02-03      10 2018-01-30
4     B 2018-02-04      10 2018-01-30
5     B 2018-02-05      10 2018-01-30

Group by new column and find weekly total amount:

grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()

Result:

  week_start  amount
0 2017-12-26      10
1 2018-01-09      40
2 2018-01-30      30

Merge dataframes on week_start:

pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)

Result:

  group       date  amount
0     A 2018-01-01      10
1     A 2018-01-14      40
2     A 2018-01-15      40
3     B 2018-02-03      30
4     B 2018-02-04      30
5     B 2018-02-05      30
Lev Zakharov
  • 2,409
  • 1
  • 10
  • 24