2

I want to find the mean of the pandas Dataframe. So I was using the following mean function which pandas provide by default. Link to its doc

df.mean()

But the problem with this function is that if the total of all the values is greater than the limit of data type, overflow occurs. In my case, I've data with float16 and number of records are more than 20 million. So obviously total of all the records will overflow float16. One approach is to change the datatype to float64 but this will use too much extra memory as each value is in range ~1900-2100. So I want to implement mean iteratively using the method given here. Here is my implementation for pandas data frame

def mean_without_overflow(df):
    avgs = []
    for column in df:
        avg, t = 0, 1
        for data in df[column]:
            if not math.isnan(data):
                avg += (data-avg) / t;
                t += 1
        avgs.append(avg)
    return avgs

Here for each column, I'm iterating all the rows. So total iterations will be # of columns * # of records. However, this does not overflow and gives a correct mean of the entire data frame but it's way slower than the default mean function provided by pandas.

So what I'm missing here? How can I optimize this? Or is there any function available in pandas out of the box for finding mean iteratively?

Edit: Overflow seems a common problem while calculating the mean. I wonder why the default mean() in pandas not implemented using such an iterative approach which prevents overflow in data types with smaller ranges.

Kaushal28
  • 5,377
  • 5
  • 41
  • 72

3 Answers3

1

Found the solution by my self. The logic is to first normalize all the values by dividing it by length of Series (# of records) and then use default df.mean() and then multiply the normalized mean with # of records: This is an improvement from 1min 37 seconds to 3.13 seconds. But I still don't understand why pandas implementation is not using such optimization.

def mean_without_overflow_fast(col):
    col /= len(col)
    return col.mean() * len(col)

Use this function as follows:

print (df.apply(mean_without_overflow_fast))
Kaushal28
  • 5,377
  • 5
  • 41
  • 72
0

Looping in pandas is slow, that why you can used apply instead.

def mean_without_overflow(column):
    avg, t = 0, 1
    for data in column:
        if not math.isnan(data): 
            avg += (data-avg) / t
            t += 1 
    return avg

Then we can compute the entire mean of the df.

mean_df = np.mean(df.apply(mean_without_overflow))

Above script is the same as

mean_df = np.mean(df.apply(np.mean))
Florian Bernard
  • 2,561
  • 1
  • 9
  • 22
0

Correct me if I'm wrong but I believe:

sum(l) / len(l) = sum(l[:n]) / len(l) + sum(l[n:2*n]) / len(l) + ...

Which means you can np.sum by batches of size n such that n * 2100 < max_float16

Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
  • Your approach is also correct. Let me implement it and observe the run time. – Kaushal28 Oct 25 '19 at 11:45
  • Max value of the `float16` is `65500.0`. So max batch size can be 31 in my case and I don't think this will improve the runtime significantly compared to default `mean()` – Kaushal28 Oct 25 '19 at 11:57