2

The following DataFrame (df) has 3 columns : Event, Counts and Length where the Counts column contains series of numbers. I would like to estimate the mean of each Counts row and display it on a fourth one.

df = 
Event                                                 Counts Length  
0     [111240.0, 110530.0, 108540.0, 108540.0, 10838...       5                                                      
7     [117240.0, 119160.0, 120160.0, 121950.0, 12195...       6
13    [180420.0, 180420.0, 183840.0, 183840.0, 18384...       6
16    [160260.0, 159430.0, 159430.0, 157720.0, 15955...      14

I have tested the following expression without success.

 df['mean'] = df['counts'].mean(axis = 1) 

and other variations based on this line.

If I understand correctly, the problem due to the fact that the Count column contains more than one element.

I appreciate your help for this apparently easy problem.

jpp
  • 159,742
  • 34
  • 281
  • 339
The_rookie
  • 35
  • 4

2 Answers2

2

You can use Numpy assuming they are all of the same length

df['Mean'] = np.mean(df['Counts'].tolist(), axis=1)
df

   Event                                           Counts  Length      Mean
0      0  [111240.0, 110530.0, 108540.0, 108540.0, 10838]       5   89937.6
1      7  [117240.0, 119160.0, 120160.0, 121950.0, 12195]       6   98141.0
2     13  [180420.0, 180420.0, 183840.0, 183840.0, 18384]       6  149380.8
3     16  [160260.0, 159430.0, 159430.0, 157720.0, 15955]      14  130559.0

Otherwise

df['Mean'] = df['Counts'].apply(np.mean)
df

   Event                                           Counts  Length      Mean
0      0  [111240.0, 110530.0, 108540.0, 108540.0, 10838]       5   89937.6
1      7  [117240.0, 119160.0, 120160.0, 121950.0, 12195]       6   98141.0
2     13  [180420.0, 180420.0, 183840.0, 183840.0, 18384]       6  149380.8
3     16  [160260.0, 159430.0, 159430.0, 157720.0, 15955]      14  130559.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • The first option you proposed gave me this error: `IndexError: tuple index out of range`. But the second option worked. Thanks a lot – The_rookie Feb 19 '18 at 17:53
  • @jp_data_analysis I'm not always trying to optimize OP's code. Sometimes (especially when they don't ask for it) I give them what seems intuitive. – piRSquared Feb 19 '18 at 17:58
  • @The_rookie that is almost certainly because your list/arrays in `Counts` are not of the same length and won't fit into a rectangular array. Which is why I offered the second solution as well. I'm glad it worked out. – piRSquared Feb 19 '18 at 18:00
  • Thanks a lot for both options. As implied by my nickname, I'm a rookie in the Python/Pandas world. – The_rookie Feb 19 '18 at 18:01
  • @The_rookie there is no shame in being a rookie. One only has to look at my first questions to know I was right there with you at one time. – piRSquared Feb 19 '18 at 18:03
  • @jp_data_analysis I promise you I'm not getting defensive. Tone is lost in comments and chat and other such medium. When I comment to people I don't know, I try to be as matter of fact as I can. Among those I'm more familiar, I may attempt humor. But if you read my comment again, take out the tone and all that's left is me being honest about my choices. I suspect I'll see you answering quite a few questions. I'd rather we understand each other and together we all can be more helpful to the community... Til the next question (-: – piRSquared Feb 19 '18 at 18:07
  • np, all good :) btw I found a comprehensive answer [here](https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists). – jpp Feb 19 '18 at 18:13
2

Here is one way which takes advantage of numpy array performance:

import pandas as pd

df = pd.DataFrame([[[111240.0, 110530.0, 108540.0, 108540.0]],
                   [[117240.0, 119160.0, 120160.0, 121950.0]],
                   [[180420.0, 180420.0, 183840.0, 183840.0]],
                   [[160260.0, 159430.0, 159430.0, 157720.0]]], columns=['Counts'])

df['mean'] = np.mean(np.vstack(df['Counts'].values), axis=1)

#                                      Counts      mean
# 0  [111240.0, 110530.0, 108540.0, 108540.0]  109712.5
# 1  [117240.0, 119160.0, 120160.0, 121950.0]  119627.5
# 2  [180420.0, 180420.0, 183840.0, 183840.0]  182130.0
# 3  [160260.0, 159430.0, 159430.0, 157720.0]  159210.0
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    The following error message was displayed : ValueError: all the input array dimensions except for the concatenation axis must match exactly I guess it is because each row has different lengths. I wanted to share this with you. The problem was solved using the previous post. Thanks anyway! – The_rookie Feb 19 '18 at 17:58
  • The error is due to the fact that not all list/arrays in `Counts` are the same length. Therefore, Numpy can't make it a rectangular ndarray. It, instead, turns it into a one dimensional array of lists which borks on `axis=1`. That said, the call to `np.vstack` is additional overhead that isn't needed. For small data, this is more performant `np.mean(df['Counts'].values.tolist(), axis=1)`. It's good to get the performance perspective as well. +1 from me. – piRSquared Feb 19 '18 at 18:13