Pandas: How to calculate the average of a groupby

Question

I have a csv file containing few attributes and one of them is the star ratings of different restaurants etoiles (means star in french). Here annee means the year when the rating was made.

note: I dont know how to share a Jupyter notebook table output in here, i tried different command lines but the format was always ugly. If someone could help with that.

And what i want to do is pretty simple (I think).. I want to add a new column that represents the standard deviation of the mean of stars per year of a restaurant. So I must estimate the average stars rating per year. Then, calculate the standard deviation on these values. But, I dont really know the syntax using pandas that will allow me to calculate the average star rating of a restaurant per year. Any suggestions?

I understand that I need to group the restaurants by year with .groupby('restaurant_id')['annee'] and then take the average of the stars rating of the restaurant during that year but i dont know how to write it.


# does not work
avis['newColumn'] = (
avis.groupby(['restaurant_id', 'annee'])['etoiles'].mean().std()
)

Here are some hints: 1) convert your dates to datetime, if you haven't already 2) group by year and take the mean 3) take the standard deviation of that. If you haven't seen Jake Van der Plas' book on how to use pandas, it should help you understand more about how to use dataframes for these kinds of things. — szeitlin, Sep 23 '20 at 17:11
@szeitlin I extracted the year of my dates (annee) because i needed it to do this task. For 2) group by year and take the mean, i could do this with .groupby('annes')['etoiles'].mean(), but i need to calculate that for each restaurant. So i would have a new column that represents the standard deviation of the mean of stars per year of a restaurant. I dont know if i expressed it correctly. — Lynn, Sep 23 '20 at 17:24
That sounds like the right approach to me. You want to group by restaurant and year, and then take two aggregations. So it would be something like `groupby(['restaurant', 'annes']).agg({'etoiles':['mean', 'stdev']})` (you may have to fiddle with the syntax, but you can do multiple aggregations from the same source column). — szeitlin, Sep 23 '20 at 17:31
@szeitlin im not sure to get what you mean. what .agg({'etoiles':['mean', 'stdev']} does here? I understand the part where i have to group the restaurant_id with a year, and then what do i do to create a column with the avg star rating per year of those restaurants? — Lynn, Sep 23 '20 at 17:42
@szeitlin i tried something, i added a code block in my post but the std are the exact same for every restaurants so it makes no sens — Lynn, Sep 23 '20 at 19:28
did you manage to do it? If no, I can have a look. Also, would be nice if you can add some data in the text form. — Grayrigel, Sep 23 '20 at 19:50
@Grayrigel no i'm stuck, the last thing i tried was the last code block in my post. I would like to add data but i dont know how to copy past my .head() output in here it's all messy — Lynn, Sep 23 '20 at 19:58

Grayrigel · Accepted Answer · 2020-09-23T23:31:01.877

Here is a potential solution with groupby:

#generating test data
dates = pd.date_range('20130101', periods=36, freq='M')
year = dates.strftime('%Y')
df = pd.DataFrame([np.random.randint(1,10) for x in range(36)],columns=['Rating'])
df['restaurants'] = ['R_{}'.format(i) for i in range(4)]*9
df['date'] = dates
df['year'] = year
print(df)

 rating restaurants  date    year
0   8   R_0     2013-01-31  2013
1   7   R_1     2013-02-28  2013
2   1   R_2     2013-03-31  2013
3   6   R_3     2013-04-30  2013
4   4   R_0     2013-05-31  2013
5   8   R_1     2013-06-30  2013
6   7   R_2     2013-07-31  2013
7   5   R_3     2013-08-31  2013
8   4   R_0     2013-09-30  2013
9   5   R_1     2013-10-31  2013
10  4   R_2     2013-11-30  2013
11  8   R_3     2013-12-31  2013
12  9   R_0     2014-01-31  2014
13  6   R_1     2014-02-28  2014
14  3   R_2     2014-03-31  2014
15  6   R_3     2014-04-30  2014
16  2   R_0     2014-05-31  2014
17  8   R_1     2014-06-30  2014
18  1   R_2     2014-07-31  2014
19  5   R_3     2014-08-31  2014
20  1   R_0     2014-09-30  2014
21  7   R_1     2014-10-31  2014
22  3   R_2     2014-11-30  2014
23  4   R_3     2014-12-31  2014
24  2   R_0     2015-01-31  2015
25  4   R_1     2015-02-28  2015
26  8   R_2     2015-03-31  2015
27  7   R_3     2015-04-30  2015
28  3   R_0     2015-05-31  2015
29  1   R_1     2015-06-30  2015
30  2   R_2     2015-07-31  2015
31  8   R_3     2015-08-31  2015
32  7   R_0     2015-09-30  2015
33  5   R_1     2015-10-31  2015
34  3   R_2     2015-11-30  2015
35  3   R_3     2015-12-31  2015

#df['date'] = pd.to_datetime(df['date']) #more versatile
#df.set_index('dates') #more versatile
#df.groupby([pd.Grouper(freq='1Y'),'restraunts'])['Rating'].mean() #more versatile

df = df.groupby(['year','restaurants']).agg({'Rating':[np.mean,np.std]})
print(df)

Output:

                  Rating   Rating
year  restaurants  mean      std
2013    R_0     5.333333    2.309401
        R_1     6.666667    1.527525
        R_2     4.000000    3.000000
        R_3     6.333333    1.527525
2014    R_0     4.000000    4.358899
        R_1     7.000000    1.000000
        R_2     2.333333    1.154701
        R_3     5.000000    1.000000
2015    R_0     4.000000    2.645751
        R_1     3.333333    2.081666
        R_2     4.333333    3.214550
        R_3     6.000000    2.645751

EDIT:

Renaming columns:

df.columns = ['Mean','STD']
df.reset_index(inplace=True)
    year   restaurant Mean     STD
0   2013    R_0     1.333333    0.577350
1   2013    R_1     5.333333    3.511885
2   2013    R_2     1.333333    0.577350
3   2013    R_3     4.333333    2.886751
4   2014    R_0     3.000000    1.000000
5   2014    R_1     3.666667    2.886751
6   2014    R_2     4.333333    4.041452
7   2014    R_3     5.333333    2.081666
8   2015    R_0     6.000000    2.645751
9   2015    R_1     6.333333    3.785939
10  2015    R_2     6.333333    3.785939
11  2015    R_3     5.666667    3.055050

So, i im not sure to really understand the output. First, concerning the data there should be multiple restaurants with the same id because people can give different ratings for the same restaurant. Then, i was expecting the avg of the ratings for the same restaurant per year, and this for each restaurant. Am i clear enough? — Lynn, Sep 23 '20 at 21:02
there should be multiple restaurants with the same id! for example multiple R_0 with different ratings through different years — Lynn, Sep 23 '20 at 21:12
Its hard to get a sample of your data. I have updated it. Please check now. Now, what I have calculated is average rating of each restaurant each year. — Grayrigel, Sep 23 '20 at 21:16
ohh yes that makes more sens, now my final step is to calculate the standard deviation of the mean of stars per year of a restaurant in a new column and map it to the restaurants id's. Is it possible? — Lynn, Sep 23 '20 at 21:19
Okay. If I understand it correctly, that's straightforward to do with `.agg`. I have updated my answer. — Grayrigel, Sep 23 '20 at 21:30
Keeping the date is pretty much impossible also irrelevant in the `groupby` because we are dealing with year aggregation. I have edited answer showing to rename. — Grayrigel, Sep 23 '20 at 21:54
i need to add std column to df according to the restaurants id's. what is the syntax to do that? — Lynn, Sep 23 '20 at 21:56
Std column I show in the my answer is according to restaurants. It is std of the rating for each year for each restaurant. — Grayrigel, Sep 23 '20 at 21:59
Please remember to accept/upvote the answer if it helped you in anyway. — Grayrigel, Sep 23 '20 at 22:20
Ofcourse, i'm still trying to figure out how to add the new std column to the initial dataframe..df.columns = ['Mean','STD'] is giving me errors — Lynn, Sep 23 '20 at 23:09
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221978/discussion-between-grayrigel-and-hazel). — Grayrigel, Sep 23 '20 at 23:26

score 0 · Answer 2 · answered Sep 23 '20 at 17:15

0

You can calculate the standard deviation of the mean of stars per year by:

df.groupby('annes')['etoiles'].mean().std()

Let me know if it worked.

answered Sep 23 '20 at 17:15

mrozsmann

73
1
6

hmm I'm not sure because i need to calculate that for each restaurant. So i would have a new column that represents the standard deviation of the mean of stars per year of a restaurant – Lynn Sep 23 '20 at 17:21
actually it would make more sens to group retaurant_id with a year no? – Lynn Sep 23 '20 at 17:43

Pandas: How to calculate the average of a groupby

2 Answers2