3

I'm aware that this similar question has been asked; however, I'm looking for further clarification to have better understanding of .groupby if it's possible. Data used

I want the exact same result like this but with .groupby():

df.pivot(columns='survived').age.plot.hist()

enter image description here

So I try:

df.groupby('age')['survived'].count().plot.hist()

enter image description here

The x-axis doesn't look right. Is there any way I can get the same result as .pivot() does using the pure .groupby() method? Thank you.

Graphite
  • 346
  • 3
  • 11
  • 1
    Somewhat equivalent would be `(df['survived'].groupby(pd.cut(df.age, bins=10)) .value_counts() .unstack().plot.bar(width=0.4) )`. – Quang Hoang Mar 19 '21 at 02:27

2 Answers2

2

Expanding on Quang's comment, you would want to bin the ages rather than grouping on every single age (which is what df.groupby('age') does).

One method is to cut the age bins:

df['age group'] = pd.cut(df.age, bins=range(0, 100, 10), right=False)

Then groupby those bins and make a bar plot of the survived.value_counts():

(df.groupby('age group').survived.value_counts()
   .unstack().plot.bar(width=1, stacked=True))

I noticed that in the link you posted, all the histograms look a little different. I think that's due to slight differences in how each method is binned. One advantage of cutting your own bins is that you can clearly see the exact bin boundaries:

histogram of survival by age

tdy
  • 36,675
  • 19
  • 86
  • 83
0

I upvoted this question because there's a very subtle difference between pivot and groupby. I think you're looking for something similar to this:

df.groupby('age').size().plot.bar(width=1)
plt.show()

However, I do not think there's a reasonable way to get the same result by grouping because hist() needs the observations in its raw form, while groupby is designed to be followed by a function that will transform the data (such as count, min, mean, etc.).

To see this, notice that by grouping by age and then using count, you no longer have the raw array of ages anymore. For instance, there are 13 observations of people who are 40 years of age. The raw data looks like (40, 40, ... , 40, 40), while the grouped count looks like:

age  count
 40     13

This is not what the data should look like for a histogram. Another key difference are the bins in a histogram. As you can see, the first plot counts all the observations of people with ages between 0 and 10. By grouping by age, you would have 11 bins inside this bin: one for people aged 0, one for people aged 1, one for people aged 2, etc.

To summarize, groupby expects a function that will transform the original data, but in order to plot a histogram, you need the data in its crude state. For this reason, pivot is the go-to solution for this kind of task, as it also splits the data by survived, but does not apply any functions the data.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76