4

I have a csv file containing few attributes and one of them is the star ratings of different restaurants etoiles (means star in french). Here annee means the year when the rating was made.

note: I dont know how to share a Jupyter notebook table output in here, i tried different command lines but the format was always ugly. If someone could help with that.

enter image description here

And what i want to do is pretty simple (I think).. I want to add a new column that represents the standard deviation of the mean of stars per year of a restaurant. So I must estimate the average stars rating per year. Then, calculate the standard deviation on these values. But, I dont really know the syntax using pandas that will allow me to calculate the average star rating of a restaurant per year. Any suggestions?

I understand that I need to group the restaurants by year with .groupby('restaurant_id')['annee'] and then take the average of the stars rating of the restaurant during that year but i dont know how to write it.


# does not work
avis['newColumn'] = (
avis.groupby(['restaurant_id', 'annee'])['etoiles'].mean().std()
)
Lynn
  • 121
  • 8
  • 25
  • Here are some hints: 1) convert your dates to datetime, if you haven't already 2) group by year and take the mean 3) take the standard deviation of that. If you haven't seen Jake Van der Plas' book on how to use pandas, it should help you understand more about how to use dataframes for these kinds of things. – szeitlin Sep 23 '20 at 17:11
  • @szeitlin I extracted the year of my dates (annee) because i needed it to do this task. For 2) group by year and take the mean, i could do this with .groupby('annes')['etoiles'].mean(), but i need to calculate that for each restaurant. So i would have a new column that represents the standard deviation of the mean of stars per year of a restaurant. I dont know if i expressed it correctly. – Lynn Sep 23 '20 at 17:24
  • 1
    That sounds like the right approach to me. You want to group by restaurant and year, and then take two aggregations. So it would be something like `groupby(['restaurant', 'annes']).agg({'etoiles':['mean', 'stdev']})` (you may have to fiddle with the syntax, but you can do multiple aggregations from the same source column). – szeitlin Sep 23 '20 at 17:31
  • @szeitlin im not sure to get what you mean. what .agg({'etoiles':['mean', 'stdev']} does here? I understand the part where i have to group the restaurant_id with a year, and then what do i do to create a column with the avg star rating per year of those restaurants? – Lynn Sep 23 '20 at 17:42
  • @szeitlin i tried something, i added a code block in my post but the std are the exact same for every restaurants so it makes no sens – Lynn Sep 23 '20 at 19:28
  • did you manage to do it? If no, I can have a look. Also, would be nice if you can add some data in the text form. – Grayrigel Sep 23 '20 at 19:50
  • @Grayrigel no i'm stuck, the last thing i tried was the last code block in my post. I would like to add data but i dont know how to copy past my .head() output in here it's all messy – Lynn Sep 23 '20 at 19:58
  • 1
    Okay. I will give it a try with a test data. – Grayrigel Sep 23 '20 at 20:17
  • I have posted an answer. Let's see if works for you. – Grayrigel Sep 23 '20 at 20:48

2 Answers2

5

Here is a potential solution with groupby:

#generating test data
dates = pd.date_range('20130101', periods=36, freq='M')
year = dates.strftime('%Y')
df = pd.DataFrame([np.random.randint(1,10) for x in range(36)],columns=['Rating'])
df['restaurants'] = ['R_{}'.format(i) for i in range(4)]*9
df['date'] = dates
df['year'] = year
print(df)

 rating restaurants  date    year
0   8   R_0     2013-01-31  2013
1   7   R_1     2013-02-28  2013
2   1   R_2     2013-03-31  2013
3   6   R_3     2013-04-30  2013
4   4   R_0     2013-05-31  2013
5   8   R_1     2013-06-30  2013
6   7   R_2     2013-07-31  2013
7   5   R_3     2013-08-31  2013
8   4   R_0     2013-09-30  2013
9   5   R_1     2013-10-31  2013
10  4   R_2     2013-11-30  2013
11  8   R_3     2013-12-31  2013
12  9   R_0     2014-01-31  2014
13  6   R_1     2014-02-28  2014
14  3   R_2     2014-03-31  2014
15  6   R_3     2014-04-30  2014
16  2   R_0     2014-05-31  2014
17  8   R_1     2014-06-30  2014
18  1   R_2     2014-07-31  2014
19  5   R_3     2014-08-31  2014
20  1   R_0     2014-09-30  2014
21  7   R_1     2014-10-31  2014
22  3   R_2     2014-11-30  2014
23  4   R_3     2014-12-31  2014
24  2   R_0     2015-01-31  2015
25  4   R_1     2015-02-28  2015
26  8   R_2     2015-03-31  2015
27  7   R_3     2015-04-30  2015
28  3   R_0     2015-05-31  2015
29  1   R_1     2015-06-30  2015
30  2   R_2     2015-07-31  2015
31  8   R_3     2015-08-31  2015
32  7   R_0     2015-09-30  2015
33  5   R_1     2015-10-31  2015
34  3   R_2     2015-11-30  2015
35  3   R_3     2015-12-31  2015
#df['date'] = pd.to_datetime(df['date']) #more versatile
#df.set_index('dates') #more versatile
#df.groupby([pd.Grouper(freq='1Y'),'restraunts'])['Rating'].mean() #more versatile

df = df.groupby(['year','restaurants']).agg({'Rating':[np.mean,np.std]})
print(df)

Output:

                  Rating   Rating
year  restaurants  mean      std
2013    R_0     5.333333    2.309401
        R_1     6.666667    1.527525
        R_2     4.000000    3.000000
        R_3     6.333333    1.527525
2014    R_0     4.000000    4.358899
        R_1     7.000000    1.000000
        R_2     2.333333    1.154701
        R_3     5.000000    1.000000
2015    R_0     4.000000    2.645751
        R_1     3.333333    2.081666
        R_2     4.333333    3.214550
        R_3     6.000000    2.645751

EDIT:

Renaming columns:

df.columns = ['Mean','STD']
df.reset_index(inplace=True)
    year   restaurant Mean     STD
0   2013    R_0     1.333333    0.577350
1   2013    R_1     5.333333    3.511885
2   2013    R_2     1.333333    0.577350
3   2013    R_3     4.333333    2.886751
4   2014    R_0     3.000000    1.000000
5   2014    R_1     3.666667    2.886751
6   2014    R_2     4.333333    4.041452
7   2014    R_3     5.333333    2.081666
8   2015    R_0     6.000000    2.645751
9   2015    R_1     6.333333    3.785939
10  2015    R_2     6.333333    3.785939
11  2015    R_3     5.666667    3.055050
Grayrigel
  • 3,474
  • 5
  • 14
  • 32
  • 1
    So, i im not sure to really understand the output. First, concerning the data there should be multiple restaurants with the same id because people can give different ratings for the same restaurant. Then, i was expecting the avg of the ratings for the same restaurant per year, and this for each restaurant. Am i clear enough? – Lynn Sep 23 '20 at 21:02
  • there should be multiple restaurants with the same id! for example multiple R_0 with different ratings through different years – Lynn Sep 23 '20 at 21:12
  • Its hard to get a sample of your data. I have updated it. Please check now. Now, what I have calculated is average rating of each restaurant each year. – Grayrigel Sep 23 '20 at 21:16
  • ohh yes that makes more sens, now my final step is to calculate the standard deviation of the mean of stars per year of a restaurant in a new column and map it to the restaurants id's. Is it possible? – Lynn Sep 23 '20 at 21:19
  • Okay. If I understand it correctly, that's straightforward to do with `.agg`. I have updated my answer. – Grayrigel Sep 23 '20 at 21:30
  • but how do i name the new column? – Lynn Sep 23 '20 at 21:41
  • there should be rating, restaurants, date , year and std – Lynn Sep 23 '20 at 21:43
  • Keeping the date is pretty much impossible also irrelevant in the `groupby` because we are dealing with year aggregation. I have edited answer showing to rename. – Grayrigel Sep 23 '20 at 21:54
  • i need to add std column to df according to the restaurants id's. what is the syntax to do that? – Lynn Sep 23 '20 at 21:56
  • Std column I show in the my answer is according to restaurants. It is std of the rating for each year for each restaurant. – Grayrigel Sep 23 '20 at 21:59
  • Please remember to accept/upvote the answer if it helped you in anyway. – Grayrigel Sep 23 '20 at 22:20
  • Ofcourse, i'm still trying to figure out how to add the new std column to the initial dataframe..df.columns = ['Mean','STD'] is giving me errors – Lynn Sep 23 '20 at 23:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221978/discussion-between-grayrigel-and-hazel). – Grayrigel Sep 23 '20 at 23:26
0

You can calculate the standard deviation of the mean of stars per year by:

df.groupby('annes')['etoiles'].mean().std()

Let me know if it worked.

mrozsmann
  • 73
  • 1
  • 6
  • hmm I'm not sure because i need to calculate that for each restaurant. So i would have a new column that represents the standard deviation of the mean of stars per year of a restaurant – Lynn Sep 23 '20 at 17:21
  • actually it would make more sens to group retaurant_id with a year no? – Lynn Sep 23 '20 at 17:43