57

I figured out these two methods. Is there a better one?

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
>>> print df.sum().sum()
42
>>> print df.values.sum()
42

Just want to make sure I'm not missing something more obvious.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Bill
  • 10,323
  • 10
  • 62
  • 85
  • 3
    Be careful, because if there are `nan` values `df.sum().sum()` ignores the `nan` and returns a `float` whereas `df.values.sum()` returns `nan`. So the 2 methods are not equivalent. – Ramon Crehuet Jan 28 '19 at 13:39

2 Answers2

70

Updated for Pandas 0.24+

df.to_numpy().sum()

Prior to Pandas 0.24+

df.values

Is the underlying numpy array

df.values.sum()

Is the numpy sum method and is faster

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks. That's what I thought! – Bill Aug 03 '16 at 02:53
  • 2
    Is it faster purely because one function calls the other or is there some more fundamental difference? – kuanb Feb 10 '17 at 01:08
  • 3
    @kuanb two reasons. One, `df.values.sum()` is a `numpy` operation and most of the time, `numpy` is more performant. Two, `numpy` sums over all elements in an array regardless of dimensionality. `pandas` requires two separate calls to `sum` one for each dimension. – piRSquared Feb 10 '17 at 09:53
6

Adding some numbers to support this:

import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))

def pandas_test():
    return df['a'].sum()

def numpy_test():
    return df['a'].to_numpy().sum()

timeit.timeit(numpy_test, number=1000)  # 0.5032469799989485
timeit.timeit(pandas_test, number=1000)  # 0.6035906639990571

So we get a 20% performance on my machine just for Series summations!

Raven
  • 648
  • 1
  • 7
  • 18
  • But is `df['a'].sum()` the same as `df['a'].to_numpy().sum()`? I think `df['a'].sum()` only sums the columns doesn't it? – Bill May 28 '20 at 14:07
  • yeah, this is just comparison for a sigle series smmation, I wasn't summing the whole df – Raven May 29 '20 at 12:19
  • 1
    Oh I see. But this question is about summing the whole dataframe, not one series. – Bill May 29 '20 at 19:53
  • Can you report your pandas and numpy versions? I get a much bigger speed difference on your tests with Pandas 0.24.2 and Numpy 1.16.2. – Bill May 29 '20 at 19:59