Custom mean implementation is slower than pandas default mean. How to optimize?

Question

I want to find the mean of the pandas Dataframe. So I was using the following mean function which pandas provide by default. Link to its doc

df.mean()

But the problem with this function is that if the total of all the values is greater than the limit of data type, overflow occurs. In my case, I've data with float16 and number of records are more than 20 million. So obviously total of all the records will overflow float16. One approach is to change the datatype to float64 but this will use too much extra memory as each value is in range ~1900-2100. So I want to implement mean iteratively using the method given here. Here is my implementation for pandas data frame

def mean_without_overflow(df):
    avgs = []
    for column in df:
        avg, t = 0, 1
        for data in df[column]:
            if not math.isnan(data):
                avg += (data-avg) / t;
                t += 1
        avgs.append(avg)
    return avgs

Here for each column, I'm iterating all the rows. So total iterations will be # of columns * # of records. However, this does not overflow and gives a correct mean of the entire data frame but it's way slower than the default mean function provided by pandas.

So what I'm missing here? How can I optimize this? Or is there any function available in pandas out of the box for finding mean iteratively?

Edit: Overflow seems a common problem while calculating the mean. I wonder why the default mean() in pandas not implemented using such an iterative approach which prevents overflow in data types with smaller ranges.

Kaushal28 · Accepted Answer · 2019-10-25T12:36:11.693

Found the solution by my self. The logic is to first normalize all the values by dividing it by length of Series (# of records) and then use default df.mean() and then multiply the normalized mean with # of records: This is an improvement from 1min 37 seconds to 3.13 seconds. But I still don't understand why pandas implementation is not using such optimization.

def mean_without_overflow_fast(col):
    col /= len(col)
    return col.mean() * len(col)

Use this function as follows:

print (df.apply(mean_without_overflow_fast))

Florian Bernard · Answer 2 · 2019-10-25T11:57:11.110

0

Looping in pandas is slow, that why you can used apply instead.

def mean_without_overflow(column):
    avg, t = 0, 1
    for data in column:
        if not math.isnan(data): 
            avg += (data-avg) / t
            t += 1 
    return avg

Then we can compute the entire mean of the df.

mean_df = np.mean(df.apply(mean_without_overflow))

Above script is the same as

mean_df = np.mean(df.apply(np.mean))

edited Oct 25 '19 at 11:57

answered Oct 25 '19 at 11:29

Florian Bernard

2,561
1
9
22

`Column` will be a string with the value of column name in df. – Kaushal28 Oct 25 '19 at 11:32
So `column-avg` will break – Kaushal28 Oct 25 '19 at 11:33
@Kaushal28 column will be equal to df[column] not the string name. – Florian Bernard Oct 25 '19 at 11:35
Yeah sorry. It will be a series. So I think we still need a loop in the function – Kaushal28 Oct 25 '19 at 11:35
@Kaushal28 yes, because as it's, the script produce only the mean of each column. I have updated my post. – Florian Bernard Oct 25 '19 at 11:37
@Kaushal28 did you know is the mean of each column break when you compute it? – Florian Bernard Oct 25 '19 at 11:39
Sorry, I didn't get it. I've updated your code. But I think nothing is improved. My implementation took 1min37 seconds and yours took 1min 36 seconds. I think we are still looping all the rows for all the columns and that's why it's same running time. Is there any reference comparing loop vs `df.apply()`? – Kaushal28 Oct 25 '19 at 11:41
@Kaushal28 witch one? The last one `mean_df = np.mean(df.apply(np.mean))` – Florian Bernard Oct 25 '19 at 11:42
No, I didn't understand what you said in the above comment. – Kaushal28 Oct 25 '19 at 11:43
@Kaushal28 Do you know if `df[column].mean()` break due to overflow? – Florian Bernard Oct 25 '19 at 11:48
It'll give `nan` due to overflow. – Kaushal28 Oct 25 '19 at 11:50

Yevhen Kuzmovych · Answer 3 · 2019-10-25T11:39:03.247

0

Correct me if I'm wrong but I believe:

sum(l) / len(l) = sum(l[:n]) / len(l) + sum(l[n:2*n]) / len(l) + ...

Which means you can np.sum by batches of size n such that n * 2100 < max_float16

edited Oct 25 '19 at 11:39

answered Oct 25 '19 at 11:31

Yevhen Kuzmovych

10,940
7
28
48

Your approach is also correct. Let me implement it and observe the run time. – Kaushal28 Oct 25 '19 at 11:45
Max value of the `float16` is `65500.0`. So max batch size can be 31 in my case and I don't think this will improve the runtime significantly compared to default `mean()` – Kaushal28 Oct 25 '19 at 11:57

Custom mean implementation is slower than pandas default mean. How to optimize?

3 Answers3