I want to find the mean of the pandas Dataframe
. So I was using the following mean function which pandas provide by default. Link to its doc
df.mean()
But the problem with this function is that if the total of all the values is greater than the limit of data type, overflow occurs. In my case, I've data with float16
and number of records are more than 20 million. So obviously total of all the records will overflow float16
. One approach is to change the datatype to float64
but this will use too much extra memory as each value is in range ~1900-2100
. So I want to implement mean iteratively using the method given here. Here is my implementation for pandas data frame
def mean_without_overflow(df):
avgs = []
for column in df:
avg, t = 0, 1
for data in df[column]:
if not math.isnan(data):
avg += (data-avg) / t;
t += 1
avgs.append(avg)
return avgs
Here for each column, I'm iterating all the rows. So total iterations will be # of columns * # of records
. However, this does not overflow and gives a correct mean of the entire data frame but it's way slower than the default mean function provided by pandas.
So what I'm missing here? How can I optimize this? Or is there any function available in pandas out of the box for finding mean iteratively?
Edit:
Overflow seems a common problem while calculating the mean. I wonder why the default mean()
in pandas not implemented using such an iterative approach which prevents overflow in data types with smaller ranges.