3

I want to find the difference between 2 columns of type int in a pandas DataFrame. I am using python 2.7. The columns are as below -

>>> df
   INVOICED_QUANTITY  QUANTITY_SHIPPED
0                 15               NaN
1                 20               NaN
2                  7               NaN
3                  7               NaN
4                  7               NaN

Now, I want to subtract QUANTITY_SHIPPED from INVOICED_QUANTITY & I do the below-

>>> df['Diff'] = df['QUANTITY_INVOICED'] - df['SHIPPED_QUANTITY']
>>> df
   QUANTITY_INVOICED  SHIPPED_QUANTITY  Diff
0                 15               NaN   NaN
1                 20               NaN   NaN
2                  7               NaN   NaN
3                  7               NaN   NaN
4                  7               NaN   NaN

How do I take care of the NaN's? I would like to get the below as result as I want NaN's to be treated as 0 (zero)-

>>> df
       QUANTITY_INVOICED  SHIPPED_QUANTITY  Diff
    0                 15               NaN   15
    1                 20               NaN   20
    2                  7               NaN   7
    3                  7               NaN   7
    4                  7               NaN   7

I do not want to do a df.fillna(0). For sum I would try something like the following & it works but not for difference -

>>> df['Sum'] = df[['QUANTITY_INVOICED', 'SHIPPED_QUANTITY']].sum(axis=1)
>>> df
   INVOICED_QUANTITY  QUANTITY_SHIPPED  Diff  Sum
0                 15               NaN   NaN   15
1                 20               NaN   NaN   20
2                  7               NaN   NaN    7
3                  7               NaN   NaN    7
4                  7               NaN   NaN    7
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
0nir
  • 1,345
  • 4
  • 20
  • 41

3 Answers3

7

You can use the sub method to perform the subtraction - this method allows NaN values to be treated as a specified value:

df['Diff'] = df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)

Which produces:

   INVOICED_QUANTITY  QUANTITY_SHIPPED  Diff
0                 15               NaN    15
1                 20               NaN    20
2                  7               NaN     7
3                  7               NaN     7
4                  7               NaN     7

The other neat way to do this is as @JianxunLi suggests: fill in the missing values in the column (creating a copy of the column) and subtract as normal.

The two approaches are almost the same, although sub is a little more efficient because it doesn't need to produce a copy of the column in advance; it just fills the missing values "on the fly":

In [46]: %timeit df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
10000 loops, best of 3: 144 µs per loop

In [47]: %timeit df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
10000 loops, best of 3: 81.7 µs per loop
Community
  • 1
  • 1
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
5

I think a simple fill NaN by 0 would help you out.

df['Diff'] = df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)

Out[153]: 
   INVOICED_QUANTITY  QUANTITY_SHIPPED  Diff
0                 15               NaN    15
1                 20               NaN    20
2                  7               NaN     7
3                  7               NaN     7
4                  7               NaN     7
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • #Jianxun Li - I do not want to do a fillna(0). Is there any other option? Have edited my question, please have a look. – 0nir Jun 25 '15 at 14:59
  • `fillna()` just returns a copy instead of modifying underlying frame. I've modified the code to tailor your need. – Jianxun Li Jun 25 '15 at 15:16
0

Solutions suggested by @Alex Riley and @Jianxun Li do not work as intended when both columns are NaN. You can slightly revise this solution of @Jianxun Li (in the expense of some computational time) to fix that.

df['Diff'] = df['INVOICED_QUANTITY'].fillna(0) - df['QUANTITY_SHIPPED'].fillna(0)

I am posting a few options to compare:

data = {'C1':  [1,2,np.nan,np.nan],
        'C2': [6,np.nan,4,np.nan],
       }
df = pd.DataFrame(data)

df['Dif']=df.C1-df.C2
df['Dif2']=df['C1'].sub(df['C2'], fill_value=0)
df['Dif3']=df['C1']-df['C2'].fillna(0)
df['Dif4']=df['C1'].fillna(0)-df['C2']
df['Dif5']=df['C1'].fillna(0)-df['C2'].fillna(0)

print (df)

which produces

     C1    C2    Dif   Dif2   Dif3   Dif4   Dif5
0 1.000 6.000 -5.000 -5.000 -5.000 -5.000 -5.000
1 2.000   NaN    NaN  2.000  2.000    NaN  2.000
2   NaN 4.000    NaN -4.000    NaN -4.000 -4.000
3   NaN   NaN    NaN    NaN    NaN    NaN  0.000
volkan g
  • 186
  • 10