1

I'm pivoting a table with np.NaN values, those I would like to be preserved after the transformation, which is not happening at the moment.

Given the below table:

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 7],
                   "E": [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})

And now applying the pivot_table:

table = pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum, fill_value='HAHA').reset_index()

The output I get:

C   A    B     large     small
0   bar  one       0       0.0
1   bar  two       7       0.0
2   foo  one       0       0.0
3   foo  two    HAHA       0.0

The output I was expecting:

C   A    B     large     small
0   bar  one    HAHA      HAHA     
1   bar  two       7      HAHA     
2   foo  one    HAHA      HAHA     
3   foo  two    HAHA      HAHA     

Question: Why is the method pivot_table populating only the group "foo - two - large" and the other groups are not having their np.NaN preserved, instead having it replaced by zero?

asa
  • 675
  • 2
  • 7
  • 17
  • 1
    It's working as designed; the sum of `np.NaN` elements is 0 (`df['E'].sum()==0`), it's only filling elements in your pivot that don't exist at all, which are the only ones that _would_ be nan in your output – G. Anderson Mar 23 '20 at 23:19
  • Does this answer your question? [Sum across all NaNs in pandas returns zero?](https://stackoverflow.com/questions/33448003/sum-across-all-nans-in-pandas-returns-zero) – G. Anderson Mar 23 '20 at 23:21
  • True, I haven't realised that the group "foo - two - large" was missing in this case. My initial thought was: since the following code "np.sum([np.nan, np.nan])" gives "nan" as result. I thought the same would happen when it's used in this method. Do you know why it does not? – asa Mar 23 '20 at 23:33
  • As in the documentation quoted in the linked question, it looks like it's just a design decision to make that the default value in pandas. I don;t know enough to say exactly why it was defaulted to `0` instead of `nan` for output – G. Anderson Mar 23 '20 at 23:37
  • 1
    @AfonsoSchulzAlbrecht that's a super subtle implementation decision. Basically when you pass `np.sum` as the aggfunc, pandas ignores that and instead uses getattr(Series, 'sum'). It's super hidden in the source code, but at the end of my solution I put in the table that is used to do this "dispatching". np.sum, np.nansum and sum all get dispatched to sum attribute of the DataFrame or Series. Slightly relevant: https://stackoverflow.com/questions/60647377/why-np-std-and-pivot-tableaggfunc-np-std-return-the-differnet-result/60647703#60647703 – ALollz Mar 23 '20 at 23:42

1 Answers1

2

This is a consequence of how np.sum is treated with groupby. The core of pivot_table is a groupby followed by reshaping. We can see where the unwanted behavior arises.

index=['A', 'B']
columns=['C']
keys = index+columns
aggfunc=np.sum

agged = df.groupby(keys).agg(aggfunc)
#                 D    E
#A   B   C              
#bar one large  0.0  0.0
#        small  0.0  0.0
#    two large  7.0  0.0
#        small  0.0  0.0
#foo one large  0.0  0.0
#        small  0.0  0.0
#    two small  0.0  0.0

For groupby the default is that missing data are summed to 0, which will not later be considered NaN. However ('foo', 'two', 'large') is missing as it never had any observations in your original DataFrame. A later reshaping step gives that group, and only that group, NaN values.

table = agged
if table.index.nlevels > 1:
    # Related GH #17123
    # If index_names are integers, determine whether the integers refer
    # to the level position or name.
    index_names = agged.index.names[: len(index)]
    to_unstack = []
    for i in range(len(index), len(keys)):
        name = agged.index.names[i]
        if name is None or name in index_names:
            to_unstack.append(i)
        else:
            to_unstack.append(name)
    table = agged.unstack(to_unstack)

print(table)
#            D           E      
#C       large small large small
#A   B                          
#bar one   0.0   0.0   0.0   0.0
#    two   7.0   0.0   0.0   0.0
#foo one   0.0   0.0   0.0   0.0
#    two   NaN   0.0   NaN   0.0

So how do you get your desired behavior? You need to set dropna=False. Pandas is also clever and tries to "dispatch" most of the basic operations to their optimized equivalents. We need to use a lambda to avoid that.

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], 
               aggfunc=lambda x: sum(x),
               dropna=False,
               fill_value='HAHA')

        large small
A   B              
bar one  HAHA  HAHA
    two     7  HAHA
foo one  HAHA  HAHA
    two  HAHA  HAHA

We can see that sum, np.nansum and np.sum all get aliased to the Series.sum attribute. The lambda is the simplest way to avoid that.

[func for func, attr in pd.DataFrame()._cython_table.items() if attr == 'sum']
#[<function sum>, <function numpy.sum>, <function numpy.nansum>]
ALollz
  • 57,915
  • 7
  • 66
  • 89