0

I probably don't really understand when or how to use the groupby function of pandas.DataFrame. In the example below I want to bin my dataframe in petal length and calculate the number of entries, the mean and spread for each bin. I can do that with three groupby calls, but then I have the answers in three separated objects. Therefore, I concat them afterwards. Now I have one object, but all columns are called sepal width, passing names to concat did not work for me. Also I would like to get the bin and the mean values e.g. for plotting, but I do not know how to do that.

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
data = pd.DataFrame(iris.data)
data.columns = iris.feature_names
data["bin"] = pd.cut(data["petal length (cm)"], 5)

g0 = data.groupby(["bin"])["sepal width (cm)"].count()
g1 = data.groupby(["bin"])["sepal width (cm)"].mean()
g2 = data.groupby(["bin"])["sepal width (cm)"].std()

# how to get better names?
g = pd.concat([g0, g1, g2], axis=1)
print g

# how to extract bin and mean e.g. for plotting?
#plt.plot(g.bin, g.mean)
Fuegon
  • 81
  • 1
  • 7
  • 3
    Use `data.groupby('bin')['sepal width (cm)'].agg(['count', 'mean', 'std'])` – Shubham Sharma Aug 22 '20 at 09:27
  • 1
    you can also use this after concat `g.columns = ['count' , 'mean' , 'std']` – Ran Cohen Aug 22 '20 at 09:29
  • very nice, both options give the same answer. Thanks for that. Any ideas about the second part? Like, how do I get the values of "bin" to e.g. plot mean width vs the center of the bin? A simple g.bin does not work, "AttributeError: 'DataFrame' object has no attribute 'bin'" – Fuegon Aug 22 '20 at 10:49

1 Answers1

1

About the second part of your question, you can use string manipulation.
If I understand correctly you can use this:

a = data['bin']
a1 = a.astype(str).str.strip('([])').str.split(',').str[0].astype(float) 
a2 = a.astype(str).str.strip('([])').str.split(',').str[1].astype(float)

data['bin_center'] = (a1+a2)/2
g = data.groupby('bin_center')['sepal width (cm)'].agg(['count', 'mean', 'std'])

plt.plot(g.index, g['mean'])

enter image description here

by the way, if you don't relly want the bin center, and you want to see the plot with the bins
you can use dataframe plot:

g = data.groupby('bin')['sepal width (cm)'].agg(['count', 'mean', 'std'])
print(g)
g['mean'].plot()

enter image description here

Ran Cohen
  • 721
  • 6
  • 15