6

I have a pandas dataframe with two columns A,B as below.

I want a vectorized solution for creating a new column C where C[i] = C[i-1] - A[i] + B[i].

df = pd.DataFrame(data={'A': [10, 2, 3, 4, 5, 6], 'B': [0, 1, 2, 3, 4, 5]})

>>> df 
     A  B
 0  10  0
 1   2  1
 2   3  2
 3   4  3
 4   5  4
 5   6  5

Here is the solution using for-loops:

df['C'] = df['A']

for i in range(1, len(df)):
    df['C'][i] = df['C'][i-1] - df['A'][i] + df['B'][i] 

>>> df
    A  B   C
0  10  0  10
1   2  1   9
2   3  2   8
3   4  3   7
4   5  4   6
5   6  5   5

... which does the job.

But since loops are slow in comparison to vectorized calculations, I want a vectorized solution for this in pandas:

I tried to use the shift() method like this:

df['C'] = df['C'].shift(1).fillna(df['A']) - df['A'] + df['B']

but it didn't help since the shifted C column isn't updated with the calculation. It keeps its original values:

>>> df['C'].shift(1).fillna(df['A'])
0    10
1    10
2     2
3     3
4     4
5     5

and that produces a wrong result.

smci
  • 32,567
  • 20
  • 113
  • 146
dimyG
  • 687
  • 9
  • 19
  • Good question, but your example data is not great since all the differences `df['B'] - df['A']` are -1 except the first row, so you wouldn't notice off-by-one bugs in solutions, indexing errors etc. – smci Apr 16 '19 at 23:18

1 Answers1

8

This can be vectorized since:

  • delta[i] = C[i] - C[i-1] = -A[i] +B[i]. You can get delta from A and B first, then...
  • calculate cumulative sum of delta (plus C[0]) to get full C

Code as follows:

delta = df['B'] - df['A']
delta[0] = 0
df['C'] = df.loc[0, 'A'] + delta.cumsum() 
​
print df
    A  B   C
0  10  0  10
1   2  1   9
2   3  2   8
3   4  3   7
4   5  4   6
5   6  5   5
smci
  • 32,567
  • 20
  • 113
  • 146
Happy001
  • 6,103
  • 2
  • 23
  • 16
  • Thanks @Happy001 it works perfectly. Its also a useful lesson for the future: If you stuck, do some algebra and rethink your problem. – dimyG Dec 28 '15 at 10:01
  • This is really neat. If you can come up with a vectorized way of doing `delta[0] = 0`, you could make it a one-liner and eliminate `delta` entirely, which will be more performant on large dataframes. – smci Apr 16 '19 at 23:34