I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).
For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:
User1 User2 User3
Snickers 0.01 NaN 0.7
Mars Bars 0.25 0.4 0.1
Milky Way 0.9 1.0 NaN
Almond Joy NaN NaN NaN
Babe Ruth 0.5 0.1 0.3
I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.
To calculate the BA, I'm using the equation presented here:
- S = score of the candy bar
- R = average of user ratings for the candy bar
- C = average of user ratings for all candy bars
- w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.
I've translated that into python as such:
def bayesian_average(df):
"""given a dataframe, returns a series of bayesian averages"""
R = df.mean(axis=1)
C = df.sum(axis=1).sum()/df.count(axis=1).sum()
w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
return ((w*R) + ((1-w)*C))
other_df['bayesian_avg'] = bayesian_average(ratings_df)
However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).
Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?
Thanks!