Pandas cumulative conditional sum by dates

Question

Imagine a pandas DataFrame like this

      date  id  initial_value  part_value
2016-01-21   1            100          10
2016-05-18   1            100          20
2016-03-15   2            150          75
2016-07-28   2            150          50
2016-08-30   2            150          25
2015-07-21   3             75          75

Generated with following

df = pd.DataFrame({
    'id': (1, 1, 2, 2, 2, 3),
    'date': tuple(pd.to_datetime(date) for date in
                  ('2016-01-21', '2016-05-18', '2016-03-15', '2016-07-28', '2016-08-30', '2015-07-21')),
    'initial_value': (100, 100, 150, 150, 150, 75),
    'part_value': (10, 20, 75, 50, 25, 75)}).sort_values(['id', 'date'])

I wish to add a column with the remaining value defined by the initial_value minus the cumulative sum of part_value conditioned on id and dates before. Hence I wish my goal is

      date  id  initial_value  part_value  goal
2016-01-21   1            100          10   100
2016-05-18   1            100          20    90
2016-03-15   2            150          75   150
2016-07-28   2            150          50    75
2016-08-30   2            150          25    25
2015-07-21   3             75          75    75

I'm thinking that a solution can be made by combining the solution from here and here, but I can't exactly figure it out.

Can you explain more `dates before` ? If dates are not sorted, output is different? — jezrael, Mar 22 '17 at 14:13
Ok, so is necessary sort values first? Or values are sorted? — jezrael, Mar 22 '17 at 14:21
I've sorted them when creating the example so they can be assumed sorted. Easy to sort nevertheless — mr.bjerre, Mar 22 '17 at 14:36

score 2 · Accepted Answer · answered Mar 22 '17 at 14:12

If dont use dates values need add, sub and groupby with cumsum:

df['goal'] = df.initial_value.add(df.part_value).sub(df.groupby('id').part_value.cumsum())
print (df)
        date  id  initial_value  part_value  goal
0 2016-01-21   1            100          10   100
1 2016-05-18   1            100          20    90
2 2016-03-15   2            150          75   150
3 2016-07-28   2            150          50    75
4 2016-08-30   2            150          25    25
5 2015-07-21   3             75          75    75

What is same as:

df['goal'] = df.initial_value + df.part_value - df.groupby('id').part_value.cumsum()
print (df)
        date  id  initial_value  part_value  goal
0 2016-01-21   1            100          10   100
1 2016-05-18   1            100          20    90
2 2016-03-15   2            150          75   150
3 2016-07-28   2            150          50    75
4 2016-08-30   2            150          25    25
5 2015-07-21   3             75          75    75

Or even: `(df.initial_value - df.groupby('id').part_value.shift(1).fillna(0)).astype(int)` ? — Jon Clements, Mar 22 '17 at 14:17
Yes, there is problem with values in 4 index - `100` instead `25` — jezrael, Mar 22 '17 at 14:19

score 0 · Answer 2 · answered Mar 22 '17 at 14:15

I actually came up with a solution myself as well. I guess it is kind of the same that is happening.

df['goal'] = df.initial_value - ((df.part_value).groupby(df.id).cumsum() - df.part_value)
df
        date  id  initial_value  part_value  goal
0 2016-01-21   1            100          10   100
1 2016-05-18   1            100          20    90
2 2016-03-15   2            150          75   150
3 2016-07-28   2            150          50    75
4 2016-08-30   2            150          25    25
5 2015-07-21   3             75          75    75

Pandas cumulative conditional sum by dates

2 Answers2