pandas groupby objects, combining and plotting

Question

I probably don't really understand when or how to use the groupby function of pandas.DataFrame. In the example below I want to bin my dataframe in petal length and calculate the number of entries, the mean and spread for each bin. I can do that with three groupby calls, but then I have the answers in three separated objects. Therefore, I concat them afterwards. Now I have one object, but all columns are called sepal width, passing names to concat did not work for me. Also I would like to get the bin and the mean values e.g. for plotting, but I do not know how to do that.

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
data = pd.DataFrame(iris.data)
data.columns = iris.feature_names
data["bin"] = pd.cut(data["petal length (cm)"], 5)

g0 = data.groupby(["bin"])["sepal width (cm)"].count()
g1 = data.groupby(["bin"])["sepal width (cm)"].mean()
g2 = data.groupby(["bin"])["sepal width (cm)"].std()

# how to get better names?
g = pd.concat([g0, g1, g2], axis=1)
print g

# how to extract bin and mean e.g. for plotting?
#plt.plot(g.bin, g.mean)

Use `data.groupby('bin')['sepal width (cm)'].agg(['count', 'mean', 'std'])` — Shubham Sharma, Aug 22 '20 at 09:27
you can also use this after concat `g.columns = ['count' , 'mean' , 'std']` — Ran Cohen, Aug 22 '20 at 09:29
very nice, both options give the same answer. Thanks for that. Any ideas about the second part? Like, how do I get the values of "bin" to e.g. plot mean width vs the center of the bin? A simple g.bin does not work, "AttributeError: 'DataFrame' object has no attribute 'bin'" — Fuegon, Aug 22 '20 at 10:49

Ran Cohen · Accepted Answer · 2020-08-22T11:32:07.127

1

About the second part of your question, you can use string manipulation.
If I understand correctly you can use this:

a = data['bin']
a1 = a.astype(str).str.strip('([])').str.split(',').str[0].astype(float) 
a2 = a.astype(str).str.strip('([])').str.split(',').str[1].astype(float)

data['bin_center'] = (a1+a2)/2
g = data.groupby('bin_center')['sepal width (cm)'].agg(['count', 'mean', 'std'])

plt.plot(g.index, g['mean'])

by the way, if you don't relly want the bin center, and you want to see the plot with the bins
you can use dataframe plot:

g = data.groupby('bin')['sepal width (cm)'].agg(['count', 'mean', 'std'])
print(g)
g['mean'].plot()

edited Aug 22 '20 at 11:32

answered Aug 22 '20 at 11:18

Ran Cohen

721
6
15

1

super, that help a lot. I only found the string parsing for the bin center kind of complicated but found that pd.cut can return the bin edges and calculated the center using them. – Fuegon Aug 22 '20 at 12:14
here - https://stackoverflow.com/a/48163924/9350669 :) – Ran Cohen Aug 22 '20 at 12:29
that's awesome :) – Fuegon Aug 22 '20 at 12:44

pandas groupby objects, combining and plotting

1 Answers1