sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

Question

taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))

sum() case:

print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
                   a         b
2011-01-01  1.328933  1.678469
2011-01-02  1.878389  1.343327
2011-01-03  0.964278  1.302857

np.nansum() case:

print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))

a    [1.32893299939, 1.87838886222, 0.964278430632,...
b    [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object

any idea why?

score 2 · Accepted Answer · answered Sep 07 '16 at 19:19

The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.

Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.

All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:

df.groupby(df.columns, axis=1).sum().fillna(0)

The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....

If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:

df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))

Example Computations

For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):

df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))

     a    a    a    b    b
0  1.0  2.0  2.0  NaN  4.0
1  NaN  NaN  NaN  3.0  3.0
2  NaN  NaN -1.0  2.0  NaN

Using sum without fillna:

df.groupby(df.columns, axis=1).sum()

     a    b
0  5.0  4.0
1  NaN  6.0
2 -1.0  2.0

Using sum and fillna:

df.groupby(df.columns, axis=1).sum().fillna(0)

     a    b
0  5.0  4.0
1  0.0  6.0
2 -1.0  2.0

Comparing to the fixed np.nansum method:

df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
     a    b
0  5.0  4.0
1  0.0  6.0
2 -1.0  2.0

sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

1 Answers1