0

I have a dataframe with users, score, times, where each user's different scores and the number of times they received it are listed:

user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2

and so on. I'd like to calculate for each user the median of the scores. For that I guess I should create a row-duplicated df, such as -

user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10

and then use groupBy and apply to calculate the median somehow?

My questions -

  1. Is this the correct approach? my df is very large so the solution has to be time efficient.
  2. If this is indeed the way to go - can you please advise how? It keeps failing for me whatever I try to do.
Alexander
  • 105,104
  • 32
  • 201
  • 196
Moshe Einhorn
  • 95
  • 2
  • 4
  • I believe you may be interested in weighted median http://stackoverflow.com/questions/26102867/python-weighted-median-algorithm-with-pandas – Ilya V. Schurov Dec 26 '15 at 20:02
  • For your 2nd, row duplicated dataframe you could do `df.groupby('user').agg(np.median)` which will give you `1` and `10` – Anton Protopopov Dec 26 '15 at 20:57

2 Answers2

0

I believe you need weighted median. I used function weighted_median from here, you can also try wquantile's weighted.median, but it interpolates in a bit different way so you may achieve nonexpected results):

import numpy as np
import pandas as pd

# from here: https://stackoverflow.com/a/32921444/3025981, CC BY-SA by Afshin @ SE
def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]

# end from

def wmed(group):
    return weighted_median(group['score'], group['times'])

import pandas as pd
df = pd.DataFrame([
        ['user1', 1, 4],
        ['user1', 7, 2],
        ['user2', 3, 1],
        ['user2', 10, 2]
        ], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)

# user
# user1     1
# user2    10
# dtype: int64
Community
  • 1
  • 1
Ilya V. Schurov
  • 7,687
  • 2
  • 40
  • 78
0
df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'], 
                   'score': [1, 7, 3, 10], 
                   'times': [4, 2, 1, 2]})

# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}

# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]

>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}

# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}
Alexander
  • 105,104
  • 32
  • 201
  • 196