4

I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).

For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:

            User1   User2   User3
Snickers    0.01    NaN     0.7
Mars Bars   0.25    0.4     0.1
Milky Way   0.9     1.0     NaN
Almond Joy  NaN     NaN     NaN
Babe Ruth   0.5     0.1     0.3

I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.

To calculate the BA, I'm using the equation presented here:

Bayesian Average

  • S = score of the candy bar
  • R = average of user ratings for the candy bar
  • C = average of user ratings for all candy bars
  • w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.

I've translated that into python as such:

def bayesian_average(df):
    """given a dataframe, returns a series of bayesian averages"""
    R = df.mean(axis=1)
    C = df.sum(axis=1).sum()/df.count(axis=1).sum()
    w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
    return ((w*R) + ((1-w)*C))

other_df['bayesian_avg'] = bayesian_average(ratings_df)

However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).

Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?

Thanks!

CaptainPlanet
  • 352
  • 2
  • 13
  • I tested your code with a dataset of 1000 columns did not return numbers greater 1, though I may be missing something here as well! – johnchase Jan 25 '19 at 03:09
  • @johnchase It's very possible the issue's arising from a bug or other issue in my code (this candy bar calculation is just an abstracted version). In which case, even just knowing that my baseline approach is correct, is extremely helpful for troubleshooting the real issue :) – CaptainPlanet Jan 25 '19 at 18:32

1 Answers1

4

I began with the dataframe you gave as an example:

d = {
    'Bar': ['Snickers', 'Mars Bars', 'Milky Way', 'Almond Joy', 'Babe Ruth'],
    'User1': [0.01, 0.25, 0.9, np.nan, 0.5],
    'User2': [np.nan, 0.4, 1.0, np.nan, 0.1],
    'User3': [0.7, 0.1, np.nan, np.nan, 0.3]
}

df = pd.DataFrame(data=d)

Which looks like this:

    Bar         User1   User2    User3
0   Snickers     0.01     NaN      0.7
1   Mars Bars    0.25     0.4      0.1
2   Milky Way    0.90     1.0      NaN
3   Almond Joy    NaN     NaN      NaN
4   Babe Ruth    0.50     0.1      0.3

The first thing I did was create a list of all columns that had user reviews:

user_cols = []
for col in df.columns.values:
    if 'User' in col:
        user_cols.append(col)

Next, I found it most straightforward to create each variable of the Bayesian Average equation either as a column in the dataframe, or as a standalone variable:

  1. Calculate the value of v for each bar:

    df['v'] = df[user_cols].count(axis=1)

  2. Calculate the value of m (equals 2.0 in this example):

    m = np.mean(df['v'])

  3. Calculate the value of w for each bar:

    df['w'] = df['v']/(df['v'] + m)

  4. And calculate the value of R for each bar:

    df['R'] = np.mean(df[user_cols], axis=1)

  5. Finally, get the value of C (equals 0.426 in this example):

    C = np.nanmean(df[user_cols].values.flatten())

And now we're ready to calculate the Bayesian Average score, S, for each candy bar:

df['S'] = df['w']*df['R'] + (1 - df['w'])*C

This gives us a dataframe that looks like this:

    Bar        User1    User2    User3   v    w      R       S
0   Snickers    0.01      NaN      0.7   2  0.5  0.355  0.3905
1   Mars Bars   0.25      0.4      0.1   3  0.6  0.250  0.3204
2   Milky Way   0.90      1.0      NaN   2  0.5  0.950  0.6880
3   Almond Joy  NaN       NaN      NaN   0  0.0    NaN     NaN
4   Babe Ruth   0.50      0.1      0.3   3  0.6  0.300  0.3504

Where the final column S contains all the S-scores for the candy bars. If you want you could then delete the v, w, and R temporary columns: df = df.drop(['v', 'w', 'R'], axis=1):

    Bar        User1    User2    User3        S
0   Snickers    0.01      NaN      0.7   0.3905
1   Mars Bars   0.25      0.4      0.1   0.3204
2   Milky Way   0.90      1.0      NaN   0.6880
3   Almond Joy  NaN       NaN      NaN      NaN
4   Babe Ruth   0.50      0.1      0.3   0.3504
James Dellinger
  • 1,281
  • 8
  • 9
  • 1
    I believe that you do not want to drop null values when calculating C, this would have the affect of dropping rows and columns that contained null values – johnchase Jan 25 '19 at 03:04
  • 1
    Thanks so much. I updated my formula for C to leave in the null values in order to prevent removal of non-null values: `np.nanmean(df[user_cols].values.flatten())`. Also updated final results for S. – James Dellinger Jan 25 '19 at 03:41
  • Thanks! Seems like I was on the right track with my code, but there are some calculations (m & C) where I could tighten up the logic. Not sure if my methods are different enough to introduce the errors I'm seeing, but with the additions of your logic, I can at least be confident that my baseline BA algorithm is correct. – CaptainPlanet Jan 25 '19 at 18:34