2

I have a DataFrame which contains the results of multiple aggregation functions applied to multiple columns, for example:

bar = pd.DataFrame([
    {'a': 1, 'b': 2, 'grp': 0}, {'a': 3, 'b': 8, 'grp': 0}, 
    {'a': 2, 'b': 2, 'grp': 1}, {'a': 4, 'b': 5, 'grp': 1}
])
bar.groupby('grp').agg([np.mean, np.std])

        a               b
    mean   std      mean  std
grp             
0   2   1.414214    5.0 4.242641
1   3   1.414214    3.5 2.121320

I want to combine the aggregation results to lists (or tuples):

grp        a                 b  
0   [2, 1.414214]     [5.0, 4.242641]
1   [3, 1.414214]     [3.5, 2.121320]

What would be the proper way to do this?

Thanks in advance!

Dmitrij
  • 23
  • 4
  • 1
    From a comment: `But further on I need to push the results into a database table which is created in exactly this manner (holds arrays of summary statistics).` Really, this is an [XY problem](https://en.wikipedia.org/wiki/XY_problem). The fix you're looking for isn't the best solution to your problem. You should ask a separate question with your core problem. – jpp Sep 06 '18 at 11:04
  • Perhaps, thank you for your input. – Dmitrij Sep 06 '18 at 11:09

2 Answers2

2

If you've to use lists in columns. You can

In [60]:  bar.groupby('grp').agg(lambda x: [x.mean(), x.std()])
Out[60]:
                             a                          b
grp
0    [2.0, 1.4142135623730951]   [5.0, 4.242640687119285]
1    [3.0, 1.4142135623730951]  [3.5, 2.1213203435596424]

Not recommended to store data like this for pandas.

Zero
  • 74,117
  • 18
  • 147
  • 154
1

What would be the proper way to do this?

There is no proper way. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.

The main reason holding lists in series is not recommended is you lose all vectorised functionally attached to having numeric series with NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers. You will lose benefits in terms of memory and performance.

See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • I understand that this is not at all what pandas dataframes are supposed to do. But further on I need to push the results into a database table which is created in exactly this manner (holds arrays of summary statistics). Sure, I can rebuild it with a couple loops, but I thought there could be a better way. – Dmitrij Sep 06 '18 at 11:01
  • 1
    @Dmitrij, Great, then it would be useful to give a *convincing* reason in your question for such a step. So far, I have never found such a reason. This is almost certainly an [XY problem](https://en.wikipedia.org/wiki/XY_problem). You are looking for a fix which isn't necessary. If you shared the whole problem (in a new question), you might find a better way. – jpp Sep 06 '18 at 11:02