Mean of a series contained within a DataFrame

Question

The following DataFrame (df) has 3 columns : Event, Counts and Length where the Counts column contains series of numbers. I would like to estimate the mean of each Counts row and display it on a fourth one.

df = 
Event                                                 Counts Length  
0     [111240.0, 110530.0, 108540.0, 108540.0, 10838...       5                                                      
7     [117240.0, 119160.0, 120160.0, 121950.0, 12195...       6
13    [180420.0, 180420.0, 183840.0, 183840.0, 18384...       6
16    [160260.0, 159430.0, 159430.0, 157720.0, 15955...      14

I have tested the following expression without success.

 df['mean'] = df['counts'].mean(axis = 1)

and other variations based on this line.

If I understand correctly, the problem due to the fact that the Count column contains more than one element.

I appreciate your help for this apparently easy problem.

try this: df['mean'] = df.Counts.map(lambda counts:pd.Series(counts).mean()) — hkyi, Feb 19 '18 at 17:22

piRSquared · Accepted Answer · 2018-02-19T17:28:17.463

2

You can use Numpy assuming they are all of the same length

df['Mean'] = np.mean(df['Counts'].tolist(), axis=1)
df

   Event                                           Counts  Length      Mean
0      0  [111240.0, 110530.0, 108540.0, 108540.0, 10838]       5   89937.6
1      7  [117240.0, 119160.0, 120160.0, 121950.0, 12195]       6   98141.0
2     13  [180420.0, 180420.0, 183840.0, 183840.0, 18384]       6  149380.8
3     16  [160260.0, 159430.0, 159430.0, 157720.0, 15955]      14  130559.0

Otherwise

df['Mean'] = df['Counts'].apply(np.mean)
df

   Event                                           Counts  Length      Mean
0      0  [111240.0, 110530.0, 108540.0, 108540.0, 10838]       5   89937.6
1      7  [117240.0, 119160.0, 120160.0, 121950.0, 12195]       6   98141.0
2     13  [180420.0, 180420.0, 183840.0, 183840.0, 18384]       6  149380.8
3     16  [160260.0, 159430.0, 159430.0, 157720.0, 15955]      14  130559.0

edited Feb 19 '18 at 17:28

answered Feb 19 '18 at 17:22

piRSquared

285,575
57
475
624

The first option you proposed gave me this error: `IndexError: tuple index out of range`. But the second option worked. Thanks a lot – The_rookie Feb 19 '18 at 17:53
@jp_data_analysis I'm not always trying to optimize OP's code. Sometimes (especially when they don't ask for it) I give them what seems intuitive. – piRSquared Feb 19 '18 at 17:58
@The_rookie that is almost certainly because your list/arrays in `Counts` are not of the same length and won't fit into a rectangular array. Which is why I offered the second solution as well. I'm glad it worked out. – piRSquared Feb 19 '18 at 18:00
Thanks a lot for both options. As implied by my nickname, I'm a rookie in the Python/Pandas world. – The_rookie Feb 19 '18 at 18:01
@The_rookie there is no shame in being a rookie. One only has to look at my first questions to know I was right there with you at one time. – piRSquared Feb 19 '18 at 18:03
@jp_data_analysis I promise you I'm not getting defensive. Tone is lost in comments and chat and other such medium. When I comment to people I don't know, I try to be as matter of fact as I can. Among those I'm more familiar, I may attempt humor. But if you read my comment again, take out the tone and all that's left is me being honest about my choices. I suspect I'll see you answering quite a few questions. I'd rather we understand each other and together we all can be more helpful to the community... Til the next question (-: – piRSquared Feb 19 '18 at 18:07
np, all good :) btw I found a comprehensive answer [here](https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists). – jpp Feb 19 '18 at 18:13

score 2 · Answer 2 · answered Feb 19 '18 at 17:30

2

Here is one way which takes advantage of numpy array performance:

import pandas as pd

df = pd.DataFrame([[[111240.0, 110530.0, 108540.0, 108540.0]],
                   [[117240.0, 119160.0, 120160.0, 121950.0]],
                   [[180420.0, 180420.0, 183840.0, 183840.0]],
                   [[160260.0, 159430.0, 159430.0, 157720.0]]], columns=['Counts'])

df['mean'] = np.mean(np.vstack(df['Counts'].values), axis=1)

#                                      Counts      mean
# 0  [111240.0, 110530.0, 108540.0, 108540.0]  109712.5
# 1  [117240.0, 119160.0, 120160.0, 121950.0]  119627.5
# 2  [180420.0, 180420.0, 183840.0, 183840.0]  182130.0
# 3  [160260.0, 159430.0, 159430.0, 157720.0]  159210.0

answered Feb 19 '18 at 17:30

jpp

159,742
34
281
339

1

The following error message was displayed : ValueError: all the input array dimensions except for the concatenation axis must match exactly I guess it is because each row has different lengths. I wanted to share this with you. The problem was solved using the previous post. Thanks anyway! – The_rookie Feb 19 '18 at 17:58
The error is due to the fact that not all list/arrays in `Counts` are the same length. Therefore, Numpy can't make it a rectangular ndarray. It, instead, turns it into a one dimensional array of lists which borks on `axis=1`. That said, the call to `np.vstack` is additional overhead that isn't needed. For small data, this is more performant `np.mean(df['Counts'].values.tolist(), axis=1)`. It's good to get the performance perspective as well. +1 from me. – piRSquared Feb 19 '18 at 18:13

Mean of a series contained within a DataFrame

2 Answers2